We upgrade a Drupal site from 4.6 to 4.7 and to my pleasant surprise everything went smoothly... until 2 weeks later the administrator of the site noticed that several blog entries were truncated at seemingly random places. He sent me 10 examples of blogs that were anywhere from 10 to 90% of their original length, all of them cut-off in the middle of sentences. Furthermore, the content was missing in the MySQL database - it was not just a problem with php or the browser not rendering the content properly.

After carefully looking at the entries I noticed that all of them had "smart quotes" in them. When I compared the truncated blog entries with the original entries (from the 4.6 backup) I noticed that they were all cut-off just where the closing smart quote should be.

Next, I tried to search the 4.6 backup for all nodes that have smart quotes. This was no easy feat because the smart quotes are represented with weird binary characters. After some struggle and googling, I discovered that using the hex function was the easiest way to do it.

One of the affected entries had

emergencies.[binary character]? The

Then I did:

SELECT HEX('emergenices.'); SELECT HEX('? The');

And finally:

SELECT HEX(body) FROM node WHERE nid=350;

I then searched for the output of the first two select statements and discovered that the hex version of the closing smart quote is:

E280

Then:

SELECT nid FROM node WHERE body regexp UNHEX('E280')

And that returned all the articles, including the ones that were not truncated.

More examination by hand revealed that the problem only happend with closing smart quotes followed by a question mark. The hex value for a question mark is:

3F

Howevever:

SELECT nid FROM node WHERE body regexp UNHEX('E2803F')

Doesn't work because the question mark is a special regular expression character. Finally I got this to work:

SELECT nid FROM node WHERE body regexp concat(unhex('E280'),'\?');

About 10% of all the nodes were affected.

Next I created a table to only store the affected data:

CREATE TABLE node_original ( nid int(10) unsigned NOT NULL, body longtext NOT NULL);

And then I selected the affect nodes into that table:

INSERT INTO node_original (nid,body) SELECT nid,body FROM node WHERE body REGEXP CONCAT(unhex('E280'),'\?');

Next step: I wrote a php script designed to be written from the command line. It bootstraps into the Drupal installation, pulls the data from the node_original table and updates the node in the drupal site (making a revision in case something bad happens).

It's attached to this blog in the event that someone might find it useful. NOTE: please don't run it without fully reading it! It could do very bad things.