Last week I told a client he could send me a file in excel and I could import that into the mysql database directly. When I got the file I learned that it wasn’t quite that easy. I had planned to just save it as tab delimited text or CSV, but there was a “description” field which was copied from various webpages (by copying the rendered text in the browser window).
Within the description field there were trademark symbols, copyright symbols, slanted quotes, triple dots, non-standard dash marks, bullets, and even a cross symbol (or dagger) used for footnotes. In addition, there were several paragraphs (including multiple line breaks) in each description field.
The first challenge to overcome was getting the database to handle line breaks properly so I could import the data. I figured if I added a special character before each new record then I could cause the database to ignore the newline characters, but when I tried pipe (|) and three pipes (|||) neither of those worked (I only got one record imported each time). It finally worked for me when I added an equal sign (=) before each new record (after checking to make sure there were no equal signs in the text – if there had been it should have worked by escaping them with a backslash and telling mysql that is the escape character).
In order to insert the equal signs I first saved the excel sheet as tab delimited text, then opened it in notepad++ and used their regular expression search and replace tool to do the insertions quickly. Since I had numbers in the first field of my data, I searched for ^(\d) and replaced it with =\1. If I wanted to spend more time on it I could have made a perl script to go straight from the .txt file to sql statements, but I already had a php file to create sql statements so I wanted to use that.
Once I had the equal signs in place, I used a load data infile command (actually created by phpmyadmin, I just had to make the selections – the local keyword was necessary for permission requirements) like this:
LOAD DATA LOCAL INFILE 'path-to-text-file-here' INTO TABLE 'myTableName' FIELDS TERMINATED BY '\t' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '=';
I was happy to get the lines imported, but the line breaks were lost and special characters showed up as question marks on the webpage where I displayed the data. So then I went back to my php script used for going from access to mysql from a few years ago and made a few modifications. I first tried using the php htmlentities() command to convert the symbols back to html code but for some reason that didn’t work for me. So through an iterative process I added replace commands for each symbol I found. I wasn’t trying to create all the symbols but to preserve the essence of the text. If you want to convert all the symbols you find you can look at a w2schools reference symbol page for a list – there is also an entity list at w3schools, if you don’t find your symbol in the first list. I also wanted to preserve the newlines so I used the php command nl2br and that worked great. Here is the final code I added for processing each field:
$data_value = trim($data_value);
$data_value = trim($data_value,'"');
if (empty($data_value)) continue;
$data_value = str_replace('™','&amp;trade;',$data_value);
$data_value = str_replace('®','&amp;reg;',$data_value);
$data_value = str_replace('’','\'',$data_value);
$data_value = str_replace('…','...',$data_value);
$data_value = str_replace('•','-',$data_value);
$data_value = str_replace('–','-',$data_value);
$data_value = str_replace('†','*',$data_value);
$data_value = str_replace('“','"',$data_value);
$data_value = str_replace('”','"',$data_value);
$data_value = str_replace('—','-',$data_value);
$data_value = nl2br($data_value);
I also couldn’t just copy the sql queries from the webpage because all the html code was converted back to symbols, so I added this line to write the queries directly to a local file:
Then I could open the queries.txt file and have a bunch of sql queries which I could run to populate my database table in mysql. My updated text to sql php script can be downloaded and used freely.
Update: I put the time into creating a perl script to make this easier, so take a look at that if the above process is too much work for you. 🙂