how to extract characters of a language?

Using grep to extract x characters of text after a predictable pattern

  • I'm trying to work out how to use something like grep or sed or awk (or maybe even some Perl) to extract a string of characters which appears in a predictable place in a series of text files. Help/advice/tutorials very welcome! I have a number of text files in a Google Drive folder, which gets replicated on my Mac, and a new one gets added every day. The files are plain text and they aren't structured formally (as in, they're not XML or anything) but they each contain a certain string ("Total Portfolio Value") which is always followed by a space, followed by "$xx,xxx.xx", followed by five more spaces. The value of $xx,xxx.xx changes every day, and that's what I'm trying to extract, to put into a separate file. I can use Automator to check whenever a new file appears and run a shell script on the file, so I'm trying to work out what goes in the shell script. As much as anything else I'm using this as a practical exercise to teach myself a bit about text processing using grep/sed/awk, Perl and regular expressions (any/all of the above!) so just a few pointers about the best approach the contents of the shell script would be great!

  • Answer:

    And perhaps you then want to extract just the price, which you can do as egrep -o 'Total Portfolio Value \$[^ ]*' /tmp/foo | cut -d$ -f2 | tr -d ,. (Because | means "send the output of the previous command to be the input of the subsequent one", and cut -d$ -f2 means "split the input at every $, and extract the second field", and tr -d , means "delete all the commas".)

infinitejones at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

Here's a solution with egrep and sed that searches for 'Total Portfolio Value', works if there are other numbers in the file (empath's solution returns all the $x,x.x in a file, which may not be what you want), and throws away the commas and the dollar sign. $ cat test.txt Total Portfolio Value $11,1234.56 more text $99,999.99 Extra line $ cat test.txt | egrep -o 'Total Portfolio Value \$[^ ]*' | sed -e 's/^Total\ Portfolio\ Value\ \$//' | sed -e 's/,//' 111234.56

caek

grep -o '\$[0-9]*,[0-9]*\.[0-9]*'

empath

that's kind of a quick and dirty way of doing it, it'll basically match $*,*.* where * is any number of digits.

empath

Or similarly egrep -o 'Total Portfolio Value \$[^ ]*' /your/file, which will match "Total Portfolio Value " and then everything up to the next space. (Because [^ ] means "anything that isn't a space", so [^ ]* means "as many characters as possible as long as they're not a space", and egrep -o means "search for the following and only return the result of the match".)

katrielalex

In some of these examples, the [^ ]* should really be [^ ]+, i.e. require there to be at least one non-space character, * matches any number including 0, so some the examples above would match a line ending in "Total Portfolio Value $". Changing it to [^[:blank:]]+ would be more robust (catch tabs too) but wouldn't guarantee that it was a number which followed the '$' sign. [[:digit:],]+ is better still but would match an arbitrary number of commas with no digits.

epo

These all look brilliant. Lots of detail for me to get my head around, which is what I was hoping for! Thanks very much everyone.

infinitejones

None of the answers offered actually match your specification. This will print the number found after "Total Portfolio Value" and a space and a dollar sign, with five spaces after it:perl -lne 'print $1 if /Total Portfolio Value \$([\d.,]+ {5})/' input_file.txt

nicwolff

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.