I realized that a number of sites I've upgraded have this issue. Crap!
Since they all have weeks or more of new content, starting the upgrade anew is not an option for me. So, here's what I've done so far and what I plan to do. Does this make sense to you?
first, in the 4.6 db I created a new table from the nodes that have the issue
CREATE TABLE node_repair SELECT * FROM node WHERE body regexp concat(unhex('E280'),'\?');
then I fix the issue in that table
UPDATE node_repair SET body=replace(body, unhex('E280'), '"') WHERE 1=1;
then I dump the node_repair table and insert it into my 4.7 db
next, I'm going to insert the body of the repaired nodes into the current revision of those nodes
(don't have the query done yet, but it is going to join node and noderevisions on the current vid and select insert the body from the right nid in noderepair)
--Eric Goldhagen
That sounds about right - I did close to the same thing (before reading your blog!). I discovered though - there are some other binary characters that can mess things up. The php script I wrote takes care of all the ones I found (em-dashes and such. Also E280 is beginning of most of these - so some of the ones that start with E280 have additional characters - so it can create weird results to just replace E280 with a double quote.
--jamie
Ah, thanks.
Based on what you sent me, instead of running your script I did it all in the database.
first I created a table and populated it with data as you suggest in the initial post.
then I ran the following queries to fix the characters UPDATE nodeoriginal SET body=replace(body, unhex('E2809C'), '"') WHERE 1=1; UPDATE nodeoriginal SET body=replace(body, unhex('E28093'), '—') WHERE 1=1; UPDATE nodeoriginal SET body=replace(body, unhex('E28094'), '—') WHERE 1=1; UPDATE nodeoriginal SET body=replace(body, unhex("E28099"), "'") WHERE 1=1; UPDATE nodeoriginal SET body=replace(body, unhex("E28098"), "'") WHERE 1=1; UPDATE nodeoriginal SET body=replace(body, unhex('E2803F'), '"') WHERE 1=1;
After that, I ran the following query to take the fixed body content and put it into the body field of the latest revision of the effected nodes.
UPDATE node, nodeoriginal, noderevisions SET noderevisions.body = nodeoriginal.body WHERE noderevisions.nid = nodeoriginal.nid AND node.vid = node_revisions.vid;
I guess I also should run a query to populate the teaser for those nodes, which is not necessary in your script as it gets done when you call the nodeapi functions directly.
--Eric Goldhagen