Friday, June 20, 2008, 11:29 AM Thoughts by John (Article #217)
I was reading Talking Points Memo today (yeah, I am a Democrat, and have been so since I turned 18 in 1996, which make me better than you, because I was a Dem when it wasn't cool) and they had an article that jumped right out at me as a tiny software not-quite-glitch.
The basics: it is an article about Barrack Obama opting out of public funding for the 2008 general election. Fair enough. But, the randomly pulled photo right next to it was one with osama bin Laden and Aymin al-Zawahiri in it. Advertisements
I think Yahoo is running some variant of the similar_text function, and that it just bit them in the ass. Depending on how you treat it, and what version you're using, you can pull false positives.
One of the things I do with my similar_text algortihms is to score something for not-quite matches. I score them a significant amount lower than 100% matches, but they're still scored.
It is easy to see how a less than 100% match could produce a false positive between Osama and Obama. If you run a quick and dirty test on PHP's similar_text function you will pull an 80% hit.
I'd imagine because Yahoo is trying so hard to always match a photo to a story, they're playing pretty fast and loose with the algorithm One thing I've learned playing with articles that have huge swathes of non-matching text is that you have to have a fairly low threshold to score a hit. For example, you can weight non-matches too heavily, because odds are high that 60-95% of the text, depending on length and subject matter, will not match.
It is not a great defense. It may not even be a defense. it is an advisory about about how bad the similar_text function can bite you.
|
© 2010 Pro Content and Design. All rights reserved.
|
Tools
Check Google PageRank
Recent articles- Did the July PageRank update come early?
- Servers handling "Pending Delete" .COM domains failing
- Photoshop CS5, first impressions
- Google PageRank toolbar updates coming today
- To Microsoft's credit
- Tracking expiring and dropping domain names
- GoDaddy finally cleans up its checkout process
- Back to basics: clean up your link names
- What the internet will look like in hell
- Early release of expired domains is rare
Welcome!
Wonder where to start with your web design business?
This blog follows along with my efforts to build and grow a website design business, Pro Content and Design.
The goal of this blog is to fill in blanks that may be empty as you get your business rolling.
This blog, particularly the source code section, is not intended for beginners. If you are not comfortable with databases, Ajax, DOM objects and other advanced methods, I strongly suggest you go take a look over at W3 Schools before even reading -- let alone tinkering with -- any of the code here.
I hope this blog has some value to web designers as they attempt to get their businesses going.
Good luck, and happy reading.
Thank you,
John Crawford
Pro Content and Design

Books
I highly recommend Art of the Start if you have no idea where to start with marketing.
Links
Coding
W3 Schools
IBM's Mastering Ajax Series
Graphic Design
Worth 1000
Stock.XCHNG
Urban Fonts
Website Software
Apache Web Server
SquirrelMail
PHP/Zend
Website Design Issues
Non-Standard Character Guide
Google Trends
Search Engine Optimization Analyzer
Business
Guy Kawasaki's Blog
Seth Godin's Blog
Freakonomics
Computers
NewEgg
My Main Website
Pro Content and Design
Websites I have built
PunxsyPage: local free classifieds website
Farm N Land: low-cost real estate listing website
Groundhog Festival: for the local summer festival
Weather Discovery Center
My Webapps
TV Stations Transmitter Database
Google PageRank Checker
|