here are a few more good links which gives u an idea about how urls are xtracted from an html page, algoritms for replacing old unused urlsin cache and storage format of the urls in cache.
http://www.mnot.net/cache_docs/#WORK
http://www.seoconsultants.com/articles/1000/cache-control.asp
http://www8.org/w8-papers/2a-webserver/caching/paper2.html
http://www.eecs.harvard.edu/~vino/web/usenix.196/
http://download-east.oracle.com/docs/cd/B15904_01/caching.1012/b14046/cache.htm
http://feedparser.org/docs/http-etag.html
http://searchoracle.techtarget.com/generic/0,295582,sid41_gci1049915,00.html
http://www.gii.upv.es/web_architecture/download/paper-20061031125802-apont.pdf
http://www8.org/w8-papers/2a-webserver/caching/paper2.html#wcaching
http://webjunction.org/do/DisplayContent?id=933
http://www.panicware.com/resource_cookies.html
http://www.cisco.com/web/about/ac123/ac147/ac174/ac199/about_cisco_ipj_archive_article09186a
00800c8903.html
Each web page within a website is an HTML file which has its own URL. After each web page is
created, they are typically linked together using a navigation menu composed of hyperlinks.
http://www2003.org/cdrom/papers/refereed/p096/p96-broder.html#Knut73
http://www.clevercomponents.com/articles/article010/urlextractor.asp
http://www.nirsoft.net/utils/addrview.html
The HTTP GET message is used to retrieve a document given its URL. It is clear from the HTTP
specification that when GET is passed to a cache, the cache may choose to return a cached
document; GET alone does not guarantee that it will return a fresh page.
http://iw3c2.cs.ust.hk/WWW5/www5conf.inria.fr/fich_html/papers/P2/Overview.html
Recent Comments