How to merge multiple CSV files into a single CSV file?

Script to manipulate .csv database files

  • I'm looking for a script that will process a CSV file. I've looked into writing one myself in python, but think its probably (definetely?!) beyond me. The script would need to be able to be made executable in windows or OS X, so that I could launch it, enter the location of the .csv input file at a prompt, and let it process the data. The script would need to work as follows: 1 - Sort all records in ascending order (i.e. a,b,c...) by the sixth column (the sixth entry on each line). Obviously the first line of the csv file would contain the names of the columns for the subsequent records, so this line would not be sorted. 2 - Now discard all the records that contain anything other than what is now in the 6th column in the first record. So for example: Line 1 A Line 2 A Line 3 A Line 4 B Line 5 C - Lines 4 and 5 would be removed from the file. (Just want to make sure I'm explaining myself properly!) 3 - The records would then need to be sorted according to a set of 3 priorities. Firstly - sort the records by the 4th column, again in ascending order, as the highest priority. The next priority would be to sort the 18th column in DESCENDING order. And the final priority would be to sort the 6th column in ASCENDING order. So after this columns 4, 18 and 6 would look like this: line 1 A Z A line 2 A Z B line 3 B A A line 4 D A B 4 - The final part of the script would be to seperate the records according to what is contained in column 4, and output all records containing the same values in column 4 to seperate files in a new folder. Records would be stripped of all columns after number 7, and a new field with a custom value would be appended on straight afterwards to make 8 fields in the outputted files. A custom first line would also be appended to each file. Ideally the files generated would take their name from the value that their records held in column 4. Explaining that was more difficult than I thought it would be! I've not much idea of the difficulty of writing such a script, so if it goes beyond the remit of google answers then just let me know, or if the price is wildly unrealistic then please also let me know. Like I say i've little idea how difficult this would be to pull off. I've any questions need to be asked or more information supplied, just let me know. Thankyou!

  • Answer:

    Dear superunknown, I have labored over the past five hours to write a Python script according to your specifications, and hereby present to you a first draft. I call it a first draft because it is unlikely that any program of this length, no matter how skillful the programmer, is free of bugs. I would be surprised if this one were. I have tested it to some extent on my own machine with a toy example, but I do not have access to your particular files. Bugs and incompatibility problems are likely to arise, as I detail below. Bear in mind, however, that I am committed to stamping out every single bug and modifying any feature until the program suits your needs precisely. My work on this question is not done until you are completely satisfied. In order to try out the script, you will first have to examine and modify the six constants defined near the beginning. Two of these are the custom field value and custom header that you requested, while the remaining four govern the behavior of the script to make it more adaptable to different circumstances. If the meaning of any of these constants is unclear, just ask. Once you have adjusted them to your liking, execute the script from a command line by typing python.exe grok.py on a Windows box or python grok.py on Linux, and see what happens. If the script crashes with a nonsensical error message, report back to me with a text dump so that I can get to work immediately on fixing the problem. If the error is reported by my own script -- you'll be able to tell from the relatively friendly message beginning with "error:" -- then it's a problem I've anticipated, and again I'll want to see a dump. Your choice of operating system may have a number of characteristics that I was unable to predict in advance. Among these unpredictable characteristics is the required reading and writing mode for .csv files. I can read and write them in binary mode on my Linux box, but your system may require text mode. A somewhat more significant problem is that of the format to which your particular CSV files adhere, since there is no single accepted CSV standard. The script at present assumes that it is reading a file in Excel's CSV format. If this is not the case, you will have to post a sample file so that I can analyze the formatting differences and implement them accordingly. Yet another difficulty is with the comparison operators, since I don't know what kinds of values you have in your fields. At present, the script properly sorts numbers and strings. If you have other kinds of data in your fields, again I'll have to see a sample file for analysis. I'm also not quite sure I've correctly interpreted your requirements for "custom value" and "custom first line", but I will work with you to adapt this and any other program feature to meet your needs. In addition, I would like to point out a few suboptimal aspects of the algorithm you specified, which I nonetheless implemented exactly to order. In particular, there is no need to sort the records by their sixth field if you merely want to find out the lowest value present in the sixth field of any record. This could be done by a traversal of the records in their existing order. Then you ask that records with any other value in the sixth field be thrown out, but later, you specify that the records be sorted according to values in the sixth field, yet the effect of the filtering step is that these values are all the same. Please don't think that I am criticizing your choice of algorithm. There may be details I don't understand, and perhaps you intend to modify the algorithm later so that these steps are necessary after all. I only wanted to bring these slight inefficiencies to your attention in case you weren't aware of them. In any event, I have implemented them according to spec. The script follows. I have included some inline comments so that it won't be entirely incomprehensible to those who have done some programming. #!/usr/bin/python # ----- begin grok.py # ***** adjust constants as necessary CUSTOM_FIELD_VALUE = 'change_me' # this value is added as the eighth field of each record # remember, fields beyond seventh are previously dropped CUSTOM_BEGINNING = 'change_me_as_well' # each result file begins with this text # careful: the script adds a newline character DISCARD_FIRST_LINE = 1 # if non-zero, first line of CSV file is thrown out # if zero, first line of CSV file is preserved RESULT_FOLDER_NAME = 'test_results' # name of folder where results are to be stored OVERWRITE_RESULTS = 0 # if non-zero, existing result folder is used # if zero, new numbered result folder is made PUT_RESULTS_THERE = 1 # if non-zero, results are stored next to CSV file # if zero, results are stored at execution point # ***** end of constants import sys, os, string, re import csv def cmp_six(a, b): return cmp(a[5], b[5]) def cmp_complex(a, b): v = cmp(a[3], b[3]) if v != 0: return v v = cmp(b[17], a[17]) if v != 0: return v return cmp(a[5], b[5]) print 'starting grok.py' while 1: # prompt user for file name sys.stdout.write('enter a file name, or nothing to quit >') fname = sys.stdin.readline().strip() if fname == '': print '\nexiting grok.py' break # check validity of file name if not os.path.isfile(fname): print 'debug:', sys.exc_info() print 'error: "%s" is not a valid file name' % fname continue # open file and begin reading records = [] print 'reading from file "%s"' % fname try: reader = csv.reader(open(fname, 'rb'), dialect='excel') line_num = 0 # discard header if required if DISCARD_FIRST_LINE: print 'discarding header %s' % reader.next() line_num += 1 # parse all lines for line_list in reader: records.append(line_list) line_num += 1 except csv.Error: print 'debug:', sys.exc_info() print 'error: csv parser failed on line %d' % line_num continue print 'read %d lines' % line_num # attempt to sort records by sixth field try: records.sort(cmp_six) except: print 'debug:', sys.exc_info() print 'error: failed to sort records by sixth field' continue # filter out records where sixth field does not have the highest value master_value = records[0][5] new_records = [] for record in records: if record[5] == master_value: new_records.append(record) records = new_records # attempt to sort: 4th field ascending, 18th descending, 6th descending try: records.sort(cmp_complex) except: print 'debug:', sys.exc_info() print 'error: failed to sort records by fields 4, 18, 6' continue # attempt to locate or make result folder write_dir = os.getcwd() if PUT_RESULTS_THERE: write_dir = os.path.dirname(fname) result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME) if os.path.exists(result_dir): if OVERWRITE_RESULTS: print 'reusing result folder "%s"' % result_dir else: i = 1 while 1: new_result_dir = '%s.%05d' % (result_dir, i) if not os.path.exists(new_result_dir): result_dir = new_result_dir print 'making result folder "%s"' % result_dir os.makedirs(result_dir) break i += 1 if i == 10**5: break if i == 10**5: print 'error: too many result folders already exist' continue else: print 'making result folder "%s"' % result_dir os.makedirs(result_dir) # separate records according to the fourth field, stripping fields past 7th h = {} record = None try: for record in records: val = record[3] if h.has_key(val): h[val].append(record[:7]) else: h[val] = [record[:7]] except: print 'debug:', sys.exc_info() print 'error: record separation failed on >%s<' % record continue # write out records to files, prepending custom line and appending 8th field try: keys = h.keys() keys.sort() for key in keys: res_fname = os.path.join(result_dir, 'result_%s.csv' % key) print 'making result file "%s"' % res_fname outf = open(res_fname, 'wb') outf.write(CUSTOM_BEGINNING+'\n') writer = csv.writer(outf, dialect='excel') line_num = 1 for row in h[key]: line_num += 1 writer.writerow(row + [CUSTOM_FIELD_VALUE]) outf.close() except: print 'debug:', sys.exc_info() print 'error: failed to write line %d of "%s"' % (line_num, res_fname) continue print 'completed processing "%s"' % fname # ----- end grok.py Let me reiterate that I do expect some bugs and/or incompatibilities to arise, and I understand that you may not be enamored with some of the features. But don't forget that I'm here to serve you, and that I won't rest until everything is perfect. Report any and all problems to me with Clarification Requests. Regards, leapinglizard

superunknown-ga at Google Answers Visit the source

Was this solution helpful to you?

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.