Search
-
Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information
-
Classic IR Goal
- Maximize R(Q, D) • Context is ignored • User is ignored • Corpus is static
– Performance • Response time • Query throughput • Indexing speed – Speed in discovery and integration of new documents – Coverage – Freshness – Spam
Parser = Tokenizer + Structure • Stopping : removing words “the”, “of”, etc. • Stemming : grouping similar words together (e.g. fishes -> fish) • Link extraction and analysis: how popular is a certain link • Information Extraction (text structure) • Classifier (topic, non-content, spam, etc.)
- Tokenization
- (Text Transformation/Processing)
- Tokenization (lexical analysis)
- Breaking the text into tokens
- Linguistic pre-processing
- Applying rules to the tokens to improve the efficiency of the index
- Tokenization (lexical analysis)
Early tokenization methodology: – A sequence of 3 or more alphanumeric characters – A space or special character indicates the end of the token – All characters were converted to lower case
• Precision: How relevant are the first few hits? • Recall: How many relevant hits are presented?
- Precision is more important than recall (When completeness is required)
Cost Per Mil (CPM) – Cost for showing the ad with 103-page shows – Important for branding campaigns Cost Per Click (CPC) – Cost for users clicking on the ad – Important for sales campaigns
- Click Fraud
• Legitimate approach: – Indexed age of the pages (older is better) – Good incoming links – Good content, well written, well organized, up to date – Good use of web standards/practices – Fast servers (quick response)
- Cloaking - Serving different content to a spider than to a browser
• Link spamming – Bots that search for blogs and leave comments with links • Clicker bots – Bots that issue queries and “click” on targeted query results
– Google: PageRank (1996 - ?), Panda (2011), Penguin (2012), Hummingbird (2013), Pigeon (2014) • Words and links.
- RankBrain: machine learning based.
– Data dumps – URL downloads – Web APIs – Web Crawling
- Be polite: try to ask the website if you can crawl it first
Seriously lacking to use in practice
- Will upset web admins (impolite) • It’s abusing the web servers
- Very slow • 1 page at a time
- Will get caught in traps and infinite sequences
- Will fetch duplicates without noticing
- Will bring in data noise
- Will miss content due to client-side scripting
1 page fetch = 500ms – How much time to crawl 1 million pages?
Web crawler client program connects to a domain name system (DNS) server • DNS server translates the hostname into an internet protocol (IP) address • Crawler then attempts to connect to server host using specific port • After connection, crawler sends an HTTP request to the web server to request a page – usually a GET request
UTF-8 uses one byte for English (=ASCII), but as many as 4 bytes for some traditional Chinese characters • variable length encoding, more difficult to do string operations – UTF-32 uses 4 bytes for every character • more memory, no backward compatibility with ASCII
Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes – Still prone to collisions, but very rare – Other functions as BLAKE2/3, MD5, SHA1, SHA2, etc…
– Important drawback : cannot react automatically to new words
- Identify syntactic phrases using a part-of-speech (POS) tagger
-
POS taggers use statistical / machine learning models of text to predict syntactic tags of words
- phrase is any sequence of n words – known as n-grams
Each index term is associated with an inverted list – Contains lists of documents, or lists of word occurrences in documents, and other information – Each entry is called a posting – The part of the posting that refers to a specific document or location is called a pointer – Each document in the collection is given a unique number – Lists are usually document-ordered (sorted by document number)
-
Jaccard : jaccard(A,B) = A ∩ B / A ∪ B - A and B don’t have to be the same size.
- Always assigns a number between 0 and 1.
- tf-idf
- The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
- w(t, d) = 1 + log10(tf(t, d)), s = sum(w(t, d))
- df(t) is the document frequency of t: the number of documents that contain t, idf(t) = log10(N/df(t))
- w(t, d) = 1 + log10(tf(t, d)) * log10(N/df(t))
- HITS
- h = sum(a), a = sum(h)
- scaling
- PageRank
- At a dead end, jump to a random web page.
- At any non-dead end, with probability 10%, jump to a random web page.
- With remaining probability (90%), go out on a random link.
- 10% - a parameter.
- Precision: how well it is doing at rejecting non-relevant documents.
- Recall: how well the search engine is doing at finding all the documents for a query.
- F-measure
- MAP
- text classification
- Bag of words model