How to delete a line from a .txt file in node.js?

How to compare and merge a large number of text files?

  • How can I automatically compare and merge a large number of text files? Due to a series of technical snafus with Dropbox and Simplenote syncing, my main writing folder, which contains text files mostly in Markdown format, is all messed up. I have about 500 unique files, but now each of them has multiple versions. For any given file, the directory contains something like this: textfile.txt textfile.md textfile.org textfile.0001.txt textfile.0002.txt They mostly have identical content - some contain extra line breaks at the beginning, or a line containing the file name. I didn't realize immediately that this had happened and that I had multiple versions, though, so with for some of them, though, I've modified one of the versions and not others. (The good news is that when I modify files, I don't edit or delete, I just add new text.) I want to reconcile my folder so that I have one canonical version for each file. Since there are now thousands of files, and more than two versions of each one, I'd rather not use a manual diffs app to reconcile them. Is there a tool that will find multiple text files containing the same content and automatically merge them? Again, the files contain duplicated content, with some new content, so simply merging the duplicated content and then adding the new content at the end of the file would be satisfactory. (I'm using OSX 10.8.2 and I write primarily in Aquamacs Emacs. Oh, and I'm going to stop using Simplenote.)

  • Answer:

    Just wanted to chime in to say that whatever you do, make a backup copy of the entire folder *first*, so if something goes horribly wrong you can undo it.

incandescentman at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

$ fdupes ../a ./b./c./d./with a space# -f to omit the first duplicate file$ fdupes -f ../b ./c./d./with a space# perl fu because spaces+shell == suckage$ fdupes -f . | perl -lne 'unlink'$ lsa`unlink` will silently fail on the blank line that fdupes places between groups of duplicate files. Probably should do `-f $_ && unlink` but meh.

zengargoyle

At a first pass I would gather checksums (probably md5) of everything to spot identical files I could get rid of. $ md5 * | sort > /tmp/md5s Looking in /tmp/md5s will then show clusters of identical files that will have identical checksums. Delete all but one of such clusters. Then, yes, I'd start a git or mercurial repository and try to use that merging mechanism to resolve the issues. The good news is ASCII text is exactly what source code control software is designed to manipulate. The bad news is that unless you're a programmer, you probably won't have any idea what is going on. Personally, I prefer mercurial over git as I find it easier to use and less argumentative but it's mostly a religious issue.

chairface

http://ask.metafilter.com/234876/How-to-compare-and-merge-a-large-number-of-text-files#3403377: "Also, I don't think there would be any satisfying solution involving genetic algorithms, as they involve randomly changing data and throwing away parts of it..." XMLicious, I didn't mean "genetic" as in "genetic algorithms". I was referring to the sort of problems that people working with actual DNA have. Eg: you might have several similar DNA sequences, and want to figure out how they may be related via mutation. Or you might have several segments of DNA which you know fit together, so you have to look for overlaps at the edges. Getting back on-topic, I don't think a three-way merge will help the OP much. "Three-way merge" specifically refers to the case where you're comparing an ancestor A to two descendants B and C. If the files you pick don't have that relationship, it may be less useful.

vasi

fdupes will handle that for you.$ fdupes -h... -d --delete prompt user for files to preserve and delete all others; important: under particular circumstances, data may be lost when using this option together with -s or --symlinks, or when specifying a particular directory more than once; refer to the fdupes documentation for additional information -N --noprompt together with --delete, preserve the first file in each set of duplicates and delete the rest without without prompting the user`fdupes -rNd [path]` -- keep one of each duplicate, delete the rest without asking.

zengargoyle

In the end I wound up using a hybrid method. After eliminating the duplicates I could using the methods provided by genius mefites, I am now using the following method to try to get the rest: searching for files with similar filenames, cat'ing them together, then eliminating duplicate lines to remove the overlap. Tedious but it seems to be working. Thanks again.

incandescentman

Wow, what an amazing community you guys are. ask.metafilter.com, I love you. Thank you all so much for lending your minds to this task, and zengargoyle, zug, vasi, thank you for donating your brains and time to helping me sort this out. I really appreciate it.

incandescentman

Just stumbled across http://peterlane.info/ferret.html while I was looking for something else:Ferret is a copy-detection tool, created at the University of Hertfordshire by members of the http://homepages.stca.herts.ac.uk/~pdgroup. Ferret locates duplicate text or code in multiple text documents or source files. The program is designed to detect copying (collusion) within a given set of files. Ferret works equally well with documents in natural language (such as English, German, etc) and with source-code files in a wide range of programming languages.It appears to be targeted at Linux, though.

XMLicious

Thank you all for your thoughts so far. I'm going to circumscribe the problem by doing more of this manually and seeking to automate a smaller portion of it. I've looked more closely at the files, and I've detected some patterns. Here's an example of 5 versions I would want to reconcile into one canonical version. I have the following 5 files: brainstorming.0001.txt brainstorming.0002.txt brainstorming.txt brainstorming.md # brainstorming.txt These files are almost identical. For each one, I'll list the filename in , and then the first 5 lines of text in the file. As you will see, each of the files repeats its filename as its first line. Some of the files repeat it twice. # brainstorming who's the audience? what are the needs brainstorming # brainstorming who's the audience? brainstorming # brainstorming who's the audience? brainstorming # brainstorming who's the audience? brainstorming # brainstorming who's the audience? ... So. If I could find a way to simply find all such variations, realize they're all the same file, and then consolidate them, I would be satisfied, and I could do the rest manually.

incandescentman

Okay, here's a unix one-liner that will find duplicate files and delete all but the first one (as I said before, you'd better have a backup before trying this): fdupes -rf1 /path/to/directory | xargs rm You'll need to install fdupes first (port install fdupes).

zug

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.