How can I find common words in two lists?
-
How can I find common words in two (or more) lists? I have several lists of surnames, and I want to compare them to find common names. For example, list one is "Adams, Jones, Smith" and list two is "Allen, Hughes, Smith". I want a tool that can identify "Smith" as the common term. (Of course, the lists I'm dealing with are much longer than these :) ) I've been looking for Mac or online tools that would help me do so, but most of what I've found is geared towards coding and looking for character-by-character differences. Are there any tools or techniques you could recommend for me to compare these texts? The closest thing to what I'm looking for is Compare Suite, which is Windows only (http://comparesuite.com/online.htm). I'd prefer to simply copy and paste name lists into a comparison tool, instead of keeping them in text files, if possible. This seems like such a basic task and I'm surprised that I haven't come across this type of tool yet. Thanks!
-
Answer:
"comm file1 file2" in the terminal should do it. You can drag your two text files into the terminal window to get their paths.
pantufla at Ask.Metafilter.Com Visit the source
Other answers
Since they're comma-separated, you could also import them into http://spreadsheets.google.com. At the end of each row, paste the formula =countif(A1:C2, "smith") Example: Adams | Jones | Smith | 2 Allen | Hughes | Smith | 1
christopherious
Oops, sorry, I misread a key part of your question there, please disregard my advice. I still think an Excel/GSS formula might still be worth looking into, but the above would not yield the answer.
christopherious
Supposing you have two files, 1.txt and 2.txt, containing one surname per line: cat 1.txt 2.txt | sort | uniq -d should do it.
Monday, stony Monday
this is ugly, but if you know that no name will be repeated in either list I'd do it something like this in Terminal.app perl -ne 'chomp; foreach $name (split(/, /)) {$seen{$name}++}; END {foreach $name (sort keys %seen) {print "$name\n" if $seen{$name} > 1}}' names1.txt names2.txt this splits on ", " counts the number of each name it sees, and then prints any that appear more than once. comm can also do the job if you can reformat the files to have 1 name per line instead of having them comma-seperated. In that case you probably want "comm -12 names1.txt names2.txt" to only print the matching items.
russm
Thanks for the feedback, guys! Wayland: I tried your suggestion, but when I run (comm -1 -2 file1.txt file2.txt), nothing results. (comm -1 file1.txt file2.txt) shows me file 1's contents, (comm -2 file1.txt file2.txt) shows me file's 2's contents, (comm -3 file1.txt file2.txt) shows me both file 1 and file 2's contents. I can't seem to get the common terms for files 1 and two to show up! Christopherious, thanks for the info. Monday: I tried your suggest, and nothing happens when I run it. Pardon my lack of terminal knowledge. When I just run cat 1.txt 2.txt, I get the contents of both files. I must be doing something wrong.
pantufla
Russm - didn't see your posting. Thanks, that code did work!
pantufla
Do you mean each line is name(s) separated by commas or do you mean each list has one name per line? The second would be most common. The comm and a lot of other text tools require the files to be sorted in order to work. Your examples are if they are indeed one name per line. $ comm -1 -2 l1.txt l2.txt Smith But comm only works on 2 files. A more generic solution can be had many ways but the general idea is the same. Combine the files, sort them, count the number of duplicate lines and print those. $ sort l1.txt l2.txt | uniq -c | sort -n 1 Adams 1 Allen 1 Hughes 1 Jones 2 Smith This sorts the 2 lists together (could be any number of lists) and then counts the number of times each line is seen (then sorts the results of that numerically to make it easier to notice answers). Since 'Smith' is in both files it has a count of 2. You can take this a bit further like so: $ sort l1.txt l2.txt | uniq -c | awk '$1==2{print $2}' Smith Which just prints the second ($2, the name) column where the first column ($1, the count) is 2. You could do this with N files and match count of N (every file) or even $1>2 to list names that occur in more than 2 files. If your files really have each line like "Name1, Name3, Name3, ..., NameN" then I would split them up first because it's way easier to work with 1 name per line files. $ perl -lne 'print join("\n",split(/,\s+/))' list_N_per_line.txt > list_1_per_line.txt Will split up the lines in the file to one name per line. Oh, and don't feel bad about comm, it's a real PITA to use sometimes. # oblig Perl golf $ perl -lane '$c{$_}++;BEGIN{$c=@ARGV}END{do{print if$c{$_}==$c}for%c}' l1.txt l2.txt Smith
zengargoyle
no worries... as others have said there are nicer ways of doing it if the data is newline separated rather than comma separated - that's generally The Unix Way(tm)...
russm
Related Q & A:
- How can I find out how many points i have on my license?Best solution by Yahoo! Answers
- How can I find an online job that I can start for free?Best solution by Yahoo! Answers
- How can I make transparent words on pictures?Best solution by Yahoo! Answers
- How can I separate a video into two parts?Best solution by Yahoo! Answers
- How can I breed common earthworms?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.