A Grave Problem with HTML Entities.
-
Is there a Unix/Mac OS X utility that can will batch fix badly-coded HTML entities? So I'm trying to convert 7 years of HTML files from hand-coded and hand-managed on Pagemill 3.0 for Windows. They're a mess, but that's okay for the most part. I'm converting them from HTML-esque to XHTML 4.0. They will eventually get redone in a custom XML schema. HTML Tidy is generally doing a bang-up job, but I'm having trouble with a bunch of files that have poorly implemented HTML entities. The problem is that Pagemill took the Windows character set and created numeric entities out of them. HTML Tidy will convert them to Unicode equivalents. Problem was, it doesn't do it correctly, or I should say doesn't recognize the problem. It will turn & ;#148; into & #igrave; instead of & ldquo; (put spaces in there to stop the posting process from converting them..) These improperly coded entities tend to choke or confuse every other utility I've tried to throw at 'em (recode, html2text.py) Any ideas how to fix these things short of learning Perl overnight?
-
Answer:
I'd have to say that given a map data structure of old->new entities, you'd probably be able to learn enough Perl to do it yourself in two lines more quickly than you'd find a utility that works for your specific case if HTML Tidy and friends don't since they're already pretty robust. My vote is for learning a skill and doing it yourself. :)
Charlie Bucket at Ask.Metafilter.Com Visit the source
Other answers
perl -nle 'BEGIN { #data structure setup here; } s#(&.*;)#$new{$1}#g; print;' file.html it really can be quite simple if you get ahold of perl-fu.
kcm
Probably they confuse other utilities because they're not really valid; code point 148 in the Windows-1252 character set is indeed a right double quote, and thus would become the 'rdquo' named entity, but the HTML spec says that the numeric reference for that character is 8221 (or x201D). Your best bet will probably be to get a table of the Windows-1252 code points between 128 and 159 (which are the most commonly problematic ones, and one such table is http://www.i18nguy.com/markup/ncrs.html#t128159) and write your own script to translate them to their standard equivalents. <pedant>there is no such thing as XHTML 4.0</pedant>
ubernostrum
If you have a list of them and if it isn't very long, it would be pretty easy to fix them using "sed" along with a shell script under UNIX. In more complex cases you could write a "lex" script, but that's black magic. (It's also extremely powerful. The first "Jive" translator, way back when, was a lex script.) Those are both standard UNIX utilities which have been part of every UNIX implementation I've ever used going all the way back to 1979, but I don't know if Apple included them in OSX. My guess would be "yes" for sed and "no" for lex, though, since lex is the front end for yacc, and yacc is obsolete. (Aren't UNIX names fun?)
Steven C. Den Beste
@Steven C. Den Beste: (On my Mac Mini Running 10.4.6): ellism-4:~ mgellis$ lex --version lex version 2.5.4 The man page for lex goes to flex, so I don't know what is up there. yacc is also installed.
mge
Thanks for the help, guys. Even when I posted after 5pm and all. I went with a sed script, something lightweight & easy to grasp, since I was going to use it anyway to scrub out a lot of extraneous meta tags anyway. When I'm done I'll run it through the sed -> perl translator to show me how else I might've done it. Sorry about the XHTML 4.0 biz, brain fart. Playing with GNU recode all afternoon must've gotten to me. FYI, no lex/yacc in OS X.
Charlie Bucket
for future reference - flex and bison are the gnu equivalents of lex and yacc, and are probably available for osx.
andrew cooke
FYI, no lex/yacc in OS X. FWIW, http://darwinports.org/ includes something listed as "byacc - Berkely yacc".
AmbroseChapel
Charlie, all 4 or them are on my OSX-MacBook Pro (running Mac OS X 10.4.6): Kims-MBP:~ kgani$ which lex yacc flex bison /usr/bin/lex /usr/bin/yacc /usr/bin/flex /usr/bin/bison Kims-MBP:~ kgani$ You may have to install developer tools for them to be there, though.
KimG
Related Q & A:
- Is this problem a knapsack problem?Best solution by Programming Puzzles & Code Golf
- How to use a custom font in html?Best solution by Stack Overflow
- How do you fix a DLL problem?Best solution by Yahoo! Answers
- I have a singing problem.Best solution by Yahoo! Answers
- What is a marketing problem with a product/company that could be addressed by research?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.