CAPTCHA & Search

CS418 - Web Programming - Spring 2015

Old Dominion University

Mat Kelly (mkelly@cs.odu.edu)
http://www.cs.odu.edu/~mkelly/cs418

derived from slides by Dr. Michele C. Weigle

Outline

  • CAPTCHA
  • Search
captcha

CAPTCHA

  • Completely Automated Public Turing test to tell Computers and Humans Apart
  • Challenge-response test provided by server
  • User solves problem and is considered (by server) human (and not a machine)
  • Goal
    • ensure that interaction is with user
    • prevent spam of all kinds, e.g., mass account creation, posts, etc.
captcha

CAPTCHA - Problem solved?

  • Vulnerable to relay attacks
    • relay captcha to human when encountered
  • Capture and re-use successful session ID
  • Dictionary attacks
  • "Iron out" images and use OCR, dictionaries
recaptcha

reCAPTCHA

  • Originates from CMU
    • bought by Google in 2009
  • Help needed to digitize books (using OCR)
    • words come from scanned books
  • "Wisdom of the Crowds"
    • reCAPTCHA contains
      • 1 term not recognized by OCR
      • 1 term well known
    • Assumption: if user gets known term right, he also gets unknown term right
    • To be confirmed by 2,3, ... others
  • Digitization project benefits
http://www.google.com/recaptcha video: https://developers.google.com/recaptcha/

No CAPTCHA reCAPTCHA

nocaptcha
catcaptcha captcha 250

Outline

  • CAPTCHA
  • Search

Relational Data Model is a Special Case...

SELECT ti.name, g.tds, g.passing_yds
FROM team_info ti, games g
WHERE ti.name = "Old Dominion"
AND g.opponent = "James Madison"
AND g.year = "2011";

Unstrcutured Data is More Common...

Unstructured data on a football stats page

Precision and Recall

Precision and Recall

Precision and Recall

Precision and Recall

Precision and Recall example

  • 10 documents in index are relevant
  • Search returns 20 documents, 5 are relevant

$P = \frac{5}{(5+15)} = 0.25$

1 out of 4 retrieved documents are relevant

$R = \frac{5}{(5+5)} = 0.5$

half of the relevant documents were retrieved

Precision and Recall

http://commons.wikimedia.org/wiki/File:Precisionrecall.svg

Precision and Recall

Search Example

  • Create and populate table for ODU football articles from odusports.com
  • Fields
    • id
    • title
    • body
    • date
    • url

LIKE and REGEXP

  • We can search rows with the "LIKE" (or "REGEXP") operator
    • http://dev.mysql.com/doc/refman/5.5/en/pattern-matching.html
    • for tables of any size, this will be s-l-o-w
  • LIKE
    • similar regular expression matching
  • REGEXP
    • extended regular expression matching

LIKE and REGEXP

  • A REGEXP pattern match succeeds if the pattern matches anywhere in the value being tested.
  • This differs from a LIKE pattern match, which succeeds only if the pattern matches the entire value

LIKE and REGEXP examples

pallenm$ftw
dknuthtek!tex
adawtf15b4b
cmooremoreM00R3!
jresigIn0JS
atanenminix!minix
linusilUvP3nGu1n5
aturing1nfin1t3TAp3
lwalloysters&camels
thewoz4daK1d5
SELECT * FROM USERS WHERE username LIKE 'a%'
adawtf15b4b
atanenminix!minix
aturing1nfin1t3TAp3
SELECT * FROM USERS WHERE password REGEXP '[0-9]{2,}'
adawtf15b4b
cmooremoreM00R3!

Full-Text Search - The Better Way

  • MATCH()...AGAINST()
    • performs a natural language search over index
  • Index = set of one or more columns of the same table
    • column must have type FULLTEXT
  • MATCH()
    • takes a comma-separated list that names the columns to be searched
  • AGAINST()
    • takes a string to search for
  • If used in WHERE clause, results returned in order of relevance score
    • relevance: similarity between search string and index row
See http://dev.mysql.com/doc/refman/5.1/en/fulltext-natural-language.html

FULLTEXT

CREATE TABLE odu_football (
 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
 title VARCHAR(200),
 body TEXT,
 date DATE,
 url VARCHAR (200),
 FULLTEXT (title, body))
  • Can only create FULLTEXT on CHAR, VARCHAR or TEXT columns
  • "title" and "body" still available as regular columns
  • If you want to search only on "title", you need to create a separate index

FULLTEXT

  • ALTER TABLE to create index
    • also need to alter table to use ENGINE=MYISAM
  • Searches
    • Heinicke (former ODU QB)
    • playoffs
    • Monarchs

Stopwords

  • Why no results for "Monarchs"?
  • If a word appears in > 50% of the rows then the word is considered a "stop word" and is not matched
    • unless you are in Boolean mode
    • SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
  • Makes sense for large collections (the word is not a good discriminator of records)
    • but can lead to unexpected results for small collections