MySQL: “SOUNDS LIKE” vs. Full-Text search

Dec 14, 2007 / By Sheeri Cabral

Tags:

A friend of mine asked me:

I’m hoping you can help me out with something — I’m trying to optimize a search feature. Since it uses a MySQL database, the search already uses the LIKE statement to get matches for a search query, we might be needing something more flexible. I found mention on MySQL’s website about something called the SOUNDS LIKE expression that can be more flexible than LIKE. Do you know anything about this? If you do, can you point me a direction where I might be able to learn more about it? Thanks in advance for your help!

My response:

I haven’t used SOUNDS LIKE, but the MySQL manual’s pretty good. So here are my thoughts on what it has to say.

The manual’s sub-section on string functions says:

expr1 SOUNDS LIKE expr2

This is the same as SOUNDEX(expr1) = SOUNDEX(expr2).

So let’s say your example is searching for “Sheeri”. You’d do WHERE field SOUNDS LIKE "%Sheeri%";

And maybe you’re hoping to get fields that contain “Sheeri” and “cheery”. However, what this will do is WHERE SOUNDEX(field) = SOUNDEX('%Sheeri%');

There is some important information in the manual’s description of the soundex function, which says that you can use SUBSTRING to get a part of a string. But you want to search on some part of a field, and you don’t know where in that field the string might be.

One of the biggest problems with a search feature using LIKE is that inevitably you use search terms such as WHERE field LIKE "%Sheeri%";

MySQL can use an index on a text field, but the internal format of the index uses the first character, then the second character, and so on, just as you search for a word in a dictionary. First you get to the section that has the first letter, then flip pages to get to the second letter, etc.

However, just as it would be impossible for you to search a dictionary for all words ending in “th”, MySQL cannot use an index on a string/text field if you have a wildcard at the beginning of your comparison. MySQL can use an index if you search

WHERE field like "Sheeri%";
and
WHERE field like "Sh%ri%";
but not
WHERE field like "%Sheeri";

I think what you really want is a full-text search. Full-text search can only be done against MyISAM tables. With full-text search, you can sort rows in order of relevance.

It won’t address the issue where someone spells something wrong in a search, but the WITH QUERY EXPANSION mode of search can help that. The manual’s sub-section on Full-Text Searches with Query Expansion says it best:

It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”.

Make sure you read all the gotchas in the Full-Text Restrictions chapter.

I hope this helps!

9 Responses to “MySQL: “SOUNDS LIKE” vs. Full-Text search”

  • gigiduru says:

    How about telling your friend to quit using this mysql toy and install something that supports natively fulltext search? Something like PostgreSQL 8.3, which is around the corner.
    Remember, friends don’t let friends use MySQL.

  • Paul Vallee says:

    Umm, if they have already adopted MySQL I’m pretty sure that’s incredibly bad advice you’re providing there. Do you have any idea how much money is involved in a database migration project, all costed out? Methinks I smell a rookie.

  • gigiduru says:

    Me thinks you’re wrong about me being a rookie but keep smelling, who knows what you get into.
    Also me thinks it’s bad to adopt a toy to do man’s job. Migrating from a db to another is just a matter of time: the sooner the better, especially when you have to migrate from a toy to a real open source RDBMS, that’s also much more robust and reliable.
    I better shoot myself than use ever again MyISAM. At the very lowest level, if I’m forced to use MySQL, I’ll use InnoDB, but that, as you probably noticed, defeats the purpose of Sheeri’s post.
    Good luck!

  • Paul Vallee says:

    I think an interesting discussion can be salvaged from this. Let’s try.

    Let’s assume for a minute that the full-text feature can be taken out of the existing database and the db migration objection can be cast aside.

    You should know I’m no apologist for MySQL or even for Oracle for that matter. I’m all about the data, not the database platform. So…

    I’m interested in arguments justifying building that full-text feature in the database at all, given the existence of the Google appliance. Assuming postgresql’s full-text features are so kickin’ compared to mysql’s, talk to us about that I’m eager to learn. Finally, if we’re gonna spend some money and it needs to be in a database, why wouldn’t we just spend $5000 on Oracle SE1 on a single quad-core CPU loaded with gads of RAM?

    Paul

  • gigiduru says:

    If you have that kind of money, I should ask you the same question: why don’t you migrate to Oracle? And in this way you solved all your problems with glitches that an open source product might have, plus you have technical support 24×7. It’s simple: you pay a boat load of money on a solid reliable product and you’ll rest assure that your RDBMS will run smoothly.
    If you don’t have that kind of money, but you still wanna migrate to a different, I’d choose PostgreSQL for two reasons:
    1. I know very well what MySQL can do and CAN’T do. The “CAN’T do” part is astonishing big, in total contrast with what MySQL AB touts on their website. “The right tool for the right job” principle is good in theory.

    2. You cannot be RDBMS agnostic, pick whatever falls in your hands and consider fit for the job. Not after I have to run REPAIR TABLE on a 700+ mil rows MyISAM table, after some insert delay glitches.
    It’s simply not professional to recommend something like this to your clients, knowing that might loose data and have downtimes.

    2. Just look at http://www.postgresql.org/docs/8.3/static/release-8-3.html it’s compelling enough to make da move. Until it’s out … hang on, your Saviour is on the way. It’ll be a steep learning curve but I’m sure you’ll make it.

    By the way, my day to day work is MySQL DBA. Hopefully, it’ll not be MySQL for a long time.
    Heck, even SQLite is better than MySQL for the simple reason that it does what it states. Meeting expectations is everything.

  • acasalamata says:

    You sure block peoples’ comments.

  • Paul Vallee says:

    > You sure block peoples’ comments.

    No we don’t – if you posted and it did not show up here, let me know and we can work together to figure it out. We only moderate for spam.

    Paul

  • Paul Vallee says:

    acasalamata: I sent an email to the address you submitted when posting your comment, unfortunately it bounced. Here’s what it said:

    Hey there,

    If your comments are being dropped from our blog, rest assured I’m interested in that not happening. Please let me know if you’re willing to work with my team to get to the bottom of the issue.

    Thanks

    Paul

    Paul Vallee, CEO, The Pythian Group, Inc.
    http://www.linkedin.com/in/paulvallee

    My email is vallee@pythian.com, reach out and we’ll figure out why your comments are being lost.

  • Sheeri Cabral says:

    Paul — great discussion.

    Putting fulltext search outside the database is ideal for the reasons you stated.

    However, it’s also not ideal for 3 reasons (people will have to weigh all the pros and cons for what’s good for their environment, as there’s no clear-cut “always” and “never” here…..):

    0) The features of a [relational] database are taken away — most importantly, the ability to relate the text to other entities. You also take away sorting, but that’s easily fixable in OS’s, and perhaps most fulltext engines, such as Lucene*. A Google appliance is quite costly for just plain ol’ text searching. There are also companies out there that sell the service of searching your data quickly — Transparencee (http://www.transparensee.com/)* is one I’ve heard of.

    This is very important, because it is an important point I make when talking about whether or not to put images in a database (see also http://sheeri.net/archives/39 ). One of the factors is easily being able to relate the data to each other — otherwise, you’re using a database to store information, not actually relate data to each other. If you want to just store data, you can put it all in flat files!

    Being able to easily look up text is important, but so is being able to update and delete text that has some certain metadata associated with it — maybe that’s a user id, or a timestamp, etc. And honestly, it’s the same with images. Yes, there are costs to storing large BLOBS in the database, but that’s a factor, not an argument, because using a database simply for storage is a silly idea anyway.

    1) In some cases you want to perform functions on the text in part or in whole. For example, encryption….

    2) The data is in more than one place. This is similar to point 0 in that if you have all the data in the database, it’s all in one place. But it warrants having its own bullet point, because it’s more difficult to get a reliable point-in-time snapshots of the data for backup and recovery. You can’t replicate all the data, either.

    I think it would be great if DBMS’ had API hooks, so you could choose whether or not to use the way they implement fulltext searching or not. That would add overhead, as the data and the search engine would have to communicate as opposed to the application going straight to the searching engine when it needs to. However, it would also make searching a black box, and not require application changes to use.

    SQL Server already uses an external solution:

    http://www.developer.com/db/article.php/3446891

    It uses Microsoft’s search to look for data. However, this also means that it doesn’t automatically re-index when a change is made. (See the “Indexing Considerations” part at http://www.sitepoint.com/blogs/2006/11/12/sql-server-full-text-search-protips-part-1-setup/ )

    And I know MySQL is working on a FULLTEXT solution that’s server-wide. Maybe they’ll include API hooks so we can drop in our own solution.

    * note that Lucene and Transparencee aren’t products or companies Pythian/I endorse nor discourage — those are simply examples I’ve heard of, but never used myself.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>