Google’s UN-clear data structures and Google algorithms
Abstract. The architecture of Google’s search engine is designed to handle large amounts of web pages and queries. The search engine uses data structures optimized for large data sets and adding more servers to the system easily increases its performance of answering queries. Google’s PageRank algorithm uses the link structure of the WWW to calculate the qualities of web pages. Using the link structure in the ordering of the search results makes it harder for anyone to manipulate Google’s search results but not impossible. Google also has some means to prevent manipulation and maintain the quality of the search results. That what where Google algorithm acts.
1 Introduction
Nowadays the WWW is very large consisting of billions of web pages and other items such as images, documents and music. At the time of writing Google is a very popular search engine on the Internet. Nielsen//Netratings has measured that almost half of searches done in the US are done with Google, Yahoo being next with 23,8% [Nie06]. From the beginning Google was designed to handle large datasets and many search queries per second. Data structures, data storage and algorithms are designed so that the system is scalable both in the number of pages index per second
and the number of queries handled per second [BrP98].
The task of search engines to provide quality search results is difficult because of the large number of documents on the WWW. The quality of Google is partly based on the PageRank algorithm. It is used to calculate a measure of quality of web pages. The position of a page on the search results page is partially dependent on the PageRank value so that quality pages have a higher position. PageRank is also meant to prevent people manipulating the search results. In this it succeeds only partially as there are still ways to manipulate Google both by words on a page and through features
of PageRank. [PBM98]
Section 2 introduces some of the data structures used by Google. Google’s PageRank algorithm and is presented in the section 3. Section 4 discusses Google’s Algorithm and some weaknesses of Google’s algorithms.
2 Data structures
The Google File System
In 2003 Google replaced its original BigFiles file system with the Google File System (GFS). Like BigFiles it is used to store files in a distributed matter. GFS consists of a single master and many chunk servers. Each file is broken into large chunks (e.g. 64 MB) identified with chunk handles. And each chunk is distributed into multiple chunk servers (e.g. three servers) for reliability. When a client wants to read or write a segment of a file it calculates the chunk index using the offset of the segment and the size of a chunk (chunk index=segment offset/chunk size). The master keeps a
mapping of a filename and a chunk index into a chunk handle. The master also keeps in memory which chunks each chunk server has. The client can ask the master for the chunk handle and the chunk servers that have the chunk by providing the filename and the chunk index. [Goo7]
The idea of GFS is to gain high performance and fault tolerant system using inexpensive hardware that often fails. GFS is optimized for storing large files, appending large amounts of data to the end of the file and making large streaming reads and small random reads. The system also features atomic concurrent appending of records into a single file with minimal synchronization overhead. This feature is used, for example, for a queue with many producers and one consumer. [Goo7]
Repository
Google stores the full HTML of all the web pages in a repository. Every web page is stored as a packet that includes a docID identifying the document, length of the URL, length of the page, URL and HTML of the page (see figure 1). These packets are stored in a compressed form to save space. The repository is simply composed of a stack of triplets of sync, length of the compressed packet and the packet itself. There are no other data structures in the repository. To find a web page in the repository requires either a pointer to the file or going through the whole file. [Goo7]
Find more details about Google algorithm check the information below and useful link to Google algorithm.
Download Google Algorithm Detail Paper source by P Huuhka http://citeseerx.ist.psu.edu/