Search

Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information
Classic IR Goal
- Maximize R(Q, D) • Context is ignored • User is ignored • Corpus is static

– Performance • Response time • Query throughput • Indexing speed – Speed in discovery and integration of new documents – Coverage – Freshness – Spam

Parser = Tokenizer + Structure • Stopping : removing words “the”, “of”, etc. • Stemming : grouping similar words together (e.g. fishes -> fish) • Link extraction and analysis: how popular is a certain link • Information Extraction (text structure) • Classifier (topic, non-content, spam, etc.)

Tokenization
(Text Transformation/Processing)
- Tokenization (lexical analysis)
  - Breaking the text into tokens
- Linguistic pre-processing
  - Applying rules to the tokens to improve the efficiency of the index

Early tokenization methodology: – A sequence of 3 or more alphanumeric characters – A space or special character indicates the end of the token – All characters were converted to lower case

• Precision: How relevant are the first few hits? • Recall: How many relevant hits are presented?

Precision is more important than recall (When completeness is required)

Cost Per Mil (CPM) – Cost for showing the ad with 103-page shows – Important for branding campaigns Cost Per Click (CPC) – Cost for users clicking on the ad – Important for sales campaigns

Click Fraud

• Legitimate approach: – Indexed age of the pages (older is better) – Good incoming links – Good content, well written, well organized, up to date – Good use of web standards/practices – Fast servers (quick response)

Cloaking - Serving different content to a spider than to a browser

• Link spamming – Bots that search for blogs and leave comments with links • Clicker bots – Bots that issue queries and “click” on targeted query results

– Google: PageRank (1996 - ?), Panda (2011), Penguin (2012), Hummingbird (2013), Pigeon (2014) • Words and links.

RankBrain: machine learning based.

– Data dumps – URL downloads – Web APIs – Web Crawling

Be polite: try to ask the website if you can crawl it first

Seriously lacking to use in practice

Will upset web admins (impolite) • It’s abusing the web servers
Very slow • 1 page at a time
Will get caught in traps and infinite sequences
Will fetch duplicates without noticing
Will bring in data noise
Will miss content due to client-side scripting

1 page fetch = 500ms – How much time to crawl 1 million pages?

Web crawler client program connects to a domain name system (DNS) server • DNS server translates the hostname into an internet protocol (IP) address • Crawler then attempts to connect to server host using specific port • After connection, crawler sends an HTTP request to the web server to request a page – usually a GET request

UTF-8 uses one byte for English (=ASCII), but as many as 4 bytes for some traditional Chinese characters • variable length encoding, more difficult to do string operations – UTF-32 uses 4 bytes for every character • more memory, no backward compatibility with ASCII

Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes – Still prone to collisions, but very rare – Other functions as BLAKE2/3, MD5, SHA1, SHA2, etc…

– Important drawback : cannot react automatically to new words

Identify syntactic phrases using a part-of-speech (POS) tagger
POS taggers use statistical / machine learning models of text to predict syntactic tags of words
phrase is any sequence of n words – known as n-grams

Each index term is associated with an inverted list – Contains lists of documents, or lists of word occurrences in documents, and other information – Each entry is called a posting – The part of the posting that refers to a specific document or location is called a pointer – Each document in the collection is given a unique number – Lists are usually document-ordered (sorted by document number)

Jaccard : jaccard(A,B) = A ∩ B / A ∪ B
- A and B don’t have to be the same size.
- Always assigns a number between 0 and 1.
tf-idf
- The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
- w(t, d) = 1 + log10(tf(t, d)), s = sum(w(t, d))
- df(t) is the document frequency of t: the number of documents that contain t, idf(t) = log10(N/df(t))
- w(t, d) = 1 + log10(tf(t, d)) * log10(N/df(t))
HITS
- h = sum(a), a = sum(h)
- scaling
PageRank
- At a dead end, jump to a random web page.
- At any non-dead end, with probability 10%, jump to a random web page.
- With remaining probability (90%), go out on a random link.
- 10% - a parameter.
Precision: how well it is doing at rejecting non-relevant documents.
Recall: how well the search engine is doing at finding all the documents for a query.
F-measure
MAP
text classification
Bag of words model

Written on December 18, 2020