rjs2600: Readings for 11-22

Web Search Engine: Parts I & II
These articles provided a nice summary of the basic functions and set-ups of various web search engines. The first article discussed how web search engines indexed certain types of information. I found it both fascinating and discouraging that millions of pieces of information are constantly being put on the web and indexed. The GYM engines (Google, Yahoo, and Microsoft) are indexing pieces of information at a thousand times the rate at which they used too. The discussion about crawling and crawlers was useful as well. Crawlers save a lot of time because they eliminate duplicate resources, which is very much appreciated in a field that is always pressed for time in terms of doing things. Part II of the article discussed various algorithms and methods that are used for search queries. The most important thing I gathered about queries was the "clever assigment of document numbers" section. Instead of numbering documents arbitrarily, the indexer can number them to reflect their decreasing query score. This achieves effective postings compression, skipping (skipping certain words), and early termination.

Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting
The Open Archives Initiative's basic mission is to "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content." However, this initiative has spread to a wide variety of other communities who were looking to provide access and information about their respective interests. Three examples were used to show the diversity of this initiative. The AmericanSouth.org project, the UIUC project, and the OAIster project were all involved with preserving and organizing important information about hose particular organization's interests. Even though this initiative has given many benefits to various organizations, there are some challenges that must be tackled. These include the varieties of metadata, the different formats of metadata, and communication problems within the initiative. I would suggest that standars should be mandated by these organizations that will help counter these problems. I would use the Dublin Core Metadata standards to correct the metadata problems, and I would suggest coming up with a definitive vocabulary for the initiative so confusion can be minimized for future providers and users.

The Deep Web: Surfacing Hidden Value
This was, by far, one of the most surprising articles I have read since coming to the MLIS program. This article basically explained the main differences between surface web sites and deep web sites. Some of the statistical information this author points out is mind-boggling. For example, deep web documents are 27% smaller than surface web documents; deep web sites receive about half as much monthly traffic as surface sites; deep web sites are more highly linked to other sites than surface sites; 97.4% of deep web sites are publicly available (which was a surprise to me); and finally, the deep web is about 500 times larger than the surface web. There is also a great diversity of topics covered in the deep web from agriculture to humanities to shopping. The thing that surprised me the most was that the deep web has an overall higher quality rating (satisfaction) than the surface web by 7.6%! All of this information leads me to believe that there is a high quality amount of information in the deep web that is not accessed. Maybe information professionals should be pushing harder to "surface" some of this information. Also, there should be a greater effort to educate the public about this topic.

2 comments:

J.LarkinNovember 15, 2010 at 3:38 PM
I was shocked by the higher quality/satisfaction number of the Deep Web materials also. I was also wondering why this information was lost to us. These materials do need to be pushed more to the surface so that users can use them, but I wonder how this would be done. Would information professionals just push certain, unknown search engines? Or, put this Deep Web materials on the Surface Web? Another question to ask if that happens would be, if it is put on the Surface Web does it still count as Deep Web materials?
Timothy GattonNovember 17, 2010 at 4:26 PM
I agree with you and Ms. Larkin--I was taken by surprise by those numbers as well for the Deep Web! I think I just assumed that the user satisfaction would be lower because of the labor-intensive method of having to do queries.

rjs2600

Monday, November 15, 2010

Readings for 11-22 - 11-26

2 comments: