Search

  • Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information

  • Classic IR Goal

    • Maximize R(Q, D) • Context is ignored • User is ignored • Corpus is static

– Performance • Response time • Query throughput • Indexing speed – Speed in discovery and integration of new documents – Coverage – Freshness – Spam

Parser = Tokenizer + Structure • Stopping : removing words “the”, “of”, etc. • Stemming : grouping similar words together (e.g. fishes -> fish) • Link extraction and analysis: how popular is a certain link • Information Extraction (text structure) • Classifier (topic, non-content, spam, etc.)

  • Tokenization
  • (Text Transformation/Processing)
    • Tokenization (lexical analysis)
      • Breaking the text into tokens
    • Linguistic pre-processing
      • Applying rules to the tokens to improve the efficiency of the index

Early tokenization methodology: – A sequence of 3 or more alphanumeric characters – A space or special character indicates the end of the token – All characters were converted to lower case

• Precision: How relevant are the first few hits? • Recall: How many relevant hits are presented?

  • Precision is more important than recall (When completeness is required)

Cost Per Mil (CPM) – Cost for showing the ad with 103-page shows – Important for branding campaigns Cost Per Click (CPC) – Cost for users clicking on the ad – Important for sales campaigns

  • Click Fraud

• Legitimate approach: – Indexed age of the pages (older is better) – Good incoming links – Good content, well written, well organized, up to date – Good use of web standards/practices – Fast servers (quick response)

  • Cloaking - Serving different content to a spider than to a browser

• Link spamming – Bots that search for blogs and leave comments with links • Clicker bots – Bots that issue queries and “click” on targeted query results

– Google: PageRank (1996 - ?), Panda (2011), Penguin (2012), Hummingbird (2013), Pigeon (2014) • Words and links.

  • RankBrain: machine learning based.

– Data dumps – URL downloads – Web APIs – Web Crawling

  • Be polite: try to ask the website if you can crawl it first

Seriously lacking to use in practice

  1. Will upset web admins (impolite) • It’s abusing the web servers
  2. Very slow • 1 page at a time
  3. Will get caught in traps and infinite sequences
  4. Will fetch duplicates without noticing
  5. Will bring in data noise
  6. Will miss content due to client-side scripting

1 page fetch = 500ms – How much time to crawl 1 million pages?

Web crawler client program connects to a domain name system (DNS) server • DNS server translates the hostname into an internet protocol (IP) address • Crawler then attempts to connect to server host using specific port • After connection, crawler sends an HTTP request to the web server to request a page – usually a GET request

UTF-8 uses one byte for English (=ASCII), but as many as 4 bytes for some traditional Chinese characters • variable length encoding, more difficult to do string operations – UTF-32 uses 4 bytes for every character • more memory, no backward compatibility with ASCII

Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes – Still prone to collisions, but very rare – Other functions as BLAKE2/3, MD5, SHA1, SHA2, etc…

– Important drawback : cannot react automatically to new words

  • Identify syntactic phrases using a part-of-speech (POS) tagger
  • POS taggers use statistical / machine learning models of text to predict syntactic tags of words

  • phrase is any sequence of n words – known as n-grams

Each index term is associated with an inverted list – Contains lists of documents, or lists of word occurrences in documents, and other information – Each entry is called a posting – The part of the posting that refers to a specific document or location is called a pointer – Each document in the collection is given a unique number – Lists are usually document-ordered (sorted by document number)

  • Jaccard : jaccard(A,B) = A ∩ B / A ∪ B
    • A and B don’t have to be the same size.
    • Always assigns a number between 0 and 1.
  • tf-idf
    • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
    • w(t, d) = 1 + log10(tf(t, d)), s = sum(w(t, d))
    • df(t) is the document frequency of t: the number of documents that contain t, idf(t) = log10(N/df(t))
    • w(t, d) = 1 + log10(tf(t, d)) * log10(N/df(t))
  • HITS
    • h = sum(a), a = sum(h)
    • scaling
  • PageRank
    • At a dead end, jump to a random web page.
    • At any non-dead end, with probability 10%, jump to a random web page.
    • With remaining probability (90%), go out on a random link.
    • 10% - a parameter.
  • Precision: how well it is doing at rejecting non-relevant documents.
  • Recall: how well the search engine is doing at finding all the documents for a query.
  • F-measure
  • MAP
  • text classification
  • Bag of words model
Written on December 18, 2020