Search engine survey report detects the secret privacy of the public

In early February, Jane Jackson rose to the popular champion of Yahoo search because of the glare, and her search request reached 20% of the total, setting a record high for Yahoo search keywords. This number is 60 times that recorded by the popular Paris Hilton, and 80 times that of the singer Britney.

The search engine on the Internet has become the best tool to reflect the interest and focus of the public.
Perhaps more realistic than any other survey statistic.

High frequency words and social hotspots

Throughout history, popular vocabulary reflects the focus of public attention in the short term, and in the long run can be spliced to the development of the world. Jon Kleinberg, a researcher at Cornell University in the United States, has conducted a survey to find popular vocabulary in different historical periods by counting the words of the US State of the Union address after 1790. For example, during the American Revolutionary War, the most frequent occurrences were "militia" and "British Army"; during the period from 1947 to 1959, "nuclear bullets" were repeatedly mentioned.

Today, search engines claim to know the secrets of the public. Search engines don't just passively answer questions; in fact, the major search engines provide all-encompassing statistics, and the results may be interesting. Keinberg believes that although the computer does not understand history, it can learn the relevant background knowledge by counting the texts in blogs, E-mails and web pages to better understand the meaning of the search request. In addition, these statistics can help sociologists and marketers discover some of the emerging trends that are being used to provide reference information for their research or operations.

In China, search engines have even taken the initiative to attack the broader business world with this ability to reflect the trend of the public. On February 12, 2004, Baidu Search and Light Media jointly released the “2003 Global Chinese Star Popularity List”. Popular vocabulary Jay Chou, "Infernal Affairs", "Dragon", spokesperson, gossip and shady are on the list. Prior to this, on January 8th, Baidu search also teamed up with Hu Run to release the “2003 China Mainland Top 100 Popularity List”.

However, search engines are sometimes overwhelmed. For example, if you try to use Google to search for Hamlet's famous saying "To be or not to be", you will find that Google has answered questions. The list is on the GNU's Not Unix official website, Hot or Not dating site... It is the shadow of Shakespeare. This classic example leads to a term for search technology - the stop word.

As the name suggests, all the power of a computer is based on calculations, even if it is read. While the search engine browses the web pages distributed in all corners, it also keeps counting the frequency of occurrence of words in the background. Some words have a very high rate of occurrence, which brings huge statistical costs, but does not contain too much specific meaning, such as the Chinese word "yes, yes" and the English word "the, and". If you want to get all the results that contain the word, it is too much. For example, encountering the high-frequency vocabulary in Hamlet's famous quote often leads to the search engine "suddenly stalling, so these words are named "stop words." When Google "read" Hamlet's famous words, he encountered four stop words. Because of helplessness, it had to search for the lowest frequency "not" and got some popular websites about "not".

If you put this famous quote in quotes, Google will suddenly open up and successfully find the relevant website. This feature is called phrase search. However, the more intelligent than Google is Alltheweb, which has included this famous phrase in the search directory, and provides relevant links directly in the results page.

Search how to achieve

“There has been a search for gerald salton from the Internet. There are 5,430 results, which are 1st to 10th. The search took 0.06 seconds.” 0.06 seconds, reflecting the speed and efficiency of the search engine represented by Google. How is this all achieved?

Normally, only 10 servers can be placed in a computer room, but Google can accommodate 80 servers in the computer room because they are bare metal with the case and parts removed. Larry Page and Sergey Brin removed the outer casing of the machine, removed the unused wafers and parts to make the machine smaller, and easier to maintain, which of course saved the cost of renting the machine room. Google uses more than 10,000 servers and distributes them in computer rooms in five different regions to cope with the vast amount of network information.

In order to respond quickly to every search request, the search engine worked hard in the early stages. They repeat the three steps in the background. In the first step, the search engine will continue to use crawlers to collect all the accessible web pages on the Internet, whether they are public or hidden—if they have been visited, they will incur a “crawler” upper body. In this way, the “crawler” who goes out regularly will hoard a massive database for the search engine. Because "crawlers" go out to follow a certain period of time, sometimes it may not be able to keep up with the speed of web page updates, so Google's "page snapshot" will appear different from the target page. In the second step, another program will count how often each word in the cached page appears. The third step is to summarize the central idea and paragraph of the page according to the word frequency, and then extract the index directory according to different keywords. Each search request by the user is calculated based on these indexes, so the response is extremely fast.

Regardless of Google's patented PageRank technology or Baidu's unique "super-chain analysis" technology, its general idea is similar: statistics show that each page is pointed to by other web links. The higher the number, the higher the level and the higher the ranking. Back to front. Some search engine experts point out that SearchRank is more accurate than PageRank. UsedRank refers to statistics that are made again based on the user clicking on the search results. Some pages may be ranked on the eighth page of the results by the initial calculations, but by looking at the properties of each link, the engine can refer to the pages where the user clicks and browses successfully. Search engines such as Alltheweb, Yahoo, and Baidu have honestly counted every click, while Google is very straightforward and doesn't do any re-stating.

Many service websites agree with the view that users are lazy. According to statistics made by clicks, many users generally only read the first page of the search results and do not browse subsequent pages. Therefore, some websites display more search results on the first page, such as Yahoo, which has 20 items on the “first page”. Sina has developed the "and-out" service form to the peak, searching for "flowers" on Valentine's Day, and suddenly jumped out of 78 websites. But search engines such as Google, Alltheweb, and Baidu still insist on a simple style, showing only 10 search results per page.

In addition to the different search algorithms, various search engines are also refining services, launching an increasingly rich search function, such as Google Image Search, which everyone loves. In fact, Alltheweb's image function is also very good, it also supports audio, video and download site search.

Integrated search engine

So, do users have to access each search engine one by one to get the best search results? Maybe not. Search integration technology can provide as much information as possible at once.

Search integration, if translated as "post-search", may sound more fashionable, but this does not reflect its iconic function of reorganizing search results. The usual search is to extract information from a variety of network resources according to a certain clue, while Meta Search is re-processing on the results of other search engines, which is a search for search.

When a user enters a keyword into the search integration engine, it simultaneously transmits search requests to a number of independently working search engines and retrieves the required information from their web database. The search integration engine does not build its own web database, and all its data comes from other search engines; therefore, the results of the integration will not be better than the results of any other search engine. However, it frees users from repetitive work while providing more organized search results – an ideal for Meta Search's early days.

The current search integration engine works in roughly two ways. A more popular approach is to integrate search results analysis, remove duplicate entries, and then perform clustering operations on topics. The best of these sites are Vivisimo, MetaCrawler and DogPile. Another type of search integration site is aimed at rigorous researchers such as SurfWax and Copernic Agent. They provide the logical operation of keywords to help users mine information while providing a large number of search results, so as to make more in-depth research. The second type of website is quite professional and generally requires payment, which is not popular among ordinary users.

About search engine data

● In the internet application ranking, the search is second only to e-mail;

● The number of keywords per person per input is 1.3;

● The occurrence rate of high-frequency vocabulary on the webpage accounts for about 1/3 of the total number of words, and it hardly works in actual search;

● Less than 0.5% of users use the advanced features of the search engine, some of which are librarians. They provide readers with information they can't find, and the tools they use are search engines, but they are advanced features;

● In 2003, Baidu was used by 17 million Chinese Internet users for 11 billion times, of which nearly 700 million searches were related to Chinese stars.

Search engine survey report detects the secret privacy of the public

recommended article

popular articles