A Free Sample Essay On Archiving Websites


Archive.org is the single largest collection of archived websites boasting “452 billion web pages saved over time” (“Internet Archive,” n.d.). While there is certainly historical significance to these collections, there are also questions surrounding ethics, legality, exploitation, and in general the right to be forgotten.  Further, it must be detailed on how this public good is neither causing nor promoting harm to website owners or those on which articles, posts, etc. are based.  The internet is a continually changing landscape where websites, social media postings, and defamatory statements may “live” in their original states for short periods of time.  Because of this, Archive.org attempts to catalogue them expeditiously before they are long forgotten and unable to be accurately referenced by future researchers.  For researchers, this is vital as they attempt to tie several sources from certain time periods together (“Internet Archive,” n.d.).  Further, there have even been court cases where Archive.org and their website archiving section Wayback Machine have been used for evidentiary purposes. Archive.org and Wayback Machine have also been accused of copyright infringement on multiple occasions which led to the removal of several sites and corresponding data. 

Further, hackers have used the site to create permanent records of their accomplishments.  Oftentimes, Archive.org will not honour removal requests from users absent a “clearly articulated legal and/or privacy concern” (“Internet Archive,” n.d.).  Until recently, website owners retained the power to choose what material was publicly available through Archive.org.  This was accomplished by preventing Archive.org from crawling its website(s).  Putting this block in place would also remove data that was previously archived. In a 2017 announcement, Archive.org stated that they would no longer honour these blocks, and requests for removal would need to be individually requested (“Internet Archive,” n.d.). The value of Archive.org is in the continual growth of its collections.  At the same time, website owners need to retain a certain level of control over data captures from the onset of publication.  The question from a research perspective is the following:  Does the historical preservation of website data outweigh any negative effects as part of being a public good? 

Background information

With the increasing need to preserve website information, many internet-based data archiving organizations have been created. This research seeks to understand one of these companies, its contribution to data preservation, and one of its main archiving components. Based on the evidence provided through this research, there will be a clearly defined need to continue the collection of internet sites through archive.org. There will be proven benefits to researchers through the use of Wayback Machine’s archiving capabilities, and a comparison of how this benefit outweighs any negative or questionable uses of the tool.

Archive.org is a nonprofit digital library based in San Francisco. The mission of this organization is to become an Internet Library whereby it is reliable to store huge data and deliver such data to users via the internet (Jaffe & Kirkpatrick, 2009). To achieve this mission, the company ensures that it offers free access to websites, games, music, images, videos, and any digitalized material to all members of the public. Besides, being an archival organization, archive.org also advocates for free and open access to the internet for all people without the fear of censorship.  The primary benefit of the Wayback Machine component is that websites remain accessible long after they’ve been taken down or changed ownership.  More specifically, website posts are no longer available to researchers (Falagas, Karveli, & Tritsaroli, 2008).  This availability is especially critical when looking at URL references.  The utilization of web links tends to be very prevalent in journal articles, and also other scholarly articles.

However, the uncertainty and eventual decay of the associated URLs are problematic for researchers.  Sadat-Moosavi et al. conducted research on the availability of online resources quoted in 4 scholarly resources precisely majoring in information science (LIS) included in the ISI in terms of accessibility and decay. They determined that URLs approximately 1028 (36%) were absent out of a total of 2886 in their first search on the internet (Jaffe & Kirkpatrick, 2009). The usage of Wayback Machine, as well as Google, accelerated the URL access rate from sixty-four percent to ninety-five percent as inaccessibility declined from thirty-six to five percent. Therefore, a vital question in terms of the importance of the Wayback Machine is not just whether research is improved, but whether the certain research would be nearly impossible to conduct accurately without this archived data. 

Literature review

Beyond research, archive.org has been used in legal proceedings to assist with proving guilt or innocence through historical evidence.  In conducting this research, we will look at the case Marten Transport v. PlatForm Advertising, a case that is still in progress in Kansas District.  A trucking company which is the plaintiff made a trademark violation lawsuit against a website that posts job opportunities for truck drivers, the defendant claiming that the there was an illegal usage of the trademark belonging to the plaintiff on the webpage of the defender (“United States v. Gasperini, 17-2479,” n.d.). As proof, the plaintiff produce screenshots of the webpage belonging to the defendant showing his usage of the trademark acquired from the way back machine, together with a validating statement testimony from a worker in the Internet Archive.

The court realized that the problem, in this case, was the defendant’s content on their website yet the defendant had not given an explanation of the reason as to why the Wayback machine’s archive of their website is undependable. As per the Federal Law of Evidence 201, the history of the contents on the website of the defendant may be accurately and willingly identified from sources that are accurate and cannot be subjected to questioning (“Rule 201,” n.d.). The research will explore further potential legal benefits through the understanding that the majority of the users visit Web archives since they cannot get the bidden pages online.  This includes a case out of the 2nd Circuit Court of Appeals.  This case involved a computer hacker from Italy who resorted to excluding screenshots of his website that was accused of relaying a virus as well as a botnet which he was eventually convicted for. The prosecutors had the hacker’s screenshots obtained out of the Internet Archive to use them as evidence which was admissible by the court.

The research will explore the application of archive websites invalidating tests. Archive.org has been applied invalidating tests. According to Sekaran (2003), predictive validity is used in showing the value of research which has been successfully imparted by using archive.org. For instance, a business can foretell its online sales depending on the number of people who visited the website as well as the inquiries made via email (Sekaran, 2003). To prove validity, the company has to sporadically interrelate website visits within a certain month as well as sales for that month or months. Recurrent correlations propose predict validity, therefore enabling the company to utilize website visits in projecting future sales. With regards to the objective, the researcher normally utilizes correlation analysis or regression analysis in testing such hypothesized relationships (Sekaran, 2003). A combination of convergent, predictive validity and nomological validity assist in achieving construct validity. Archive.org and the Way back machine are used for tracking a site’s evolution since the user is capable of viewing the original version of every site and also the date together with content updates. For instance, a researcher can examine the evolution of a certain business’s online client relationship programs through the analysis of a sequential archived type of the site of the company. Literally, researchers utilize the Wayback machine in tracking and measuring web content progress.

The Wayback machine has also gained legal acceptance in intellectual property as well as trademark issues. In a certain 2004 landmark case in the United States, it was ruled by the court that pages accrued from the Wayback machine were acceptable to be used as evidence. Even though they are admissible, Wayback machine use is limited (Picolli et al.,2004).  Wayback machine archives accessible simple HTML sites that are accessible by the public but have challenges to archive password protected sites. Moreover, sites may reject inclusion by sending an email to the Internet Archive or by utilising of Robot Exclusion standard for specifying files or directories to avoid crawling (Picolli et al.,2004). Literally, the intellectual property proprietors worried about the breaches of the third party sites are capable of requesting to remove such contents.

Any kind of action like this terminates future indexing, removes site content from the archive, as well as limits the inclusiveness of the archive. Finally, the requirement for the usage of Alexa WebCrawler, the Internet Archive should delay for approximately a 6-months period after gauging prior to inclusion of site updates into the archive (Straub, 2004). Attached with the essential time for surveying the 55 billion pages that are archived, this leads to a time interval of about twelve months so that a snapshot that occurred can be captured (Straub, 2004). The examination of the evolution of websites that is facilitated through web archiving assists researchers in investigating the aspects leading to successful website execution, entailing exact aspects of the organizations included as well as excluded in their webpages. Essentially, evolution is a limiting factor to studies since one point at a time does not have the potential of capturing such evolution. As longitudinal studies can facilitate the tracking of altering relationships for researchers, undertaking several evaluations is hard and tiresome. Moreover, certain websites might not be existing anymore whereas certain alterations are temporary. For example, the research involving more than 1000 websites from 6 different groups identified only a few of the sites were still operating using similar URLs 5 years later (Straub, 2004).

Depending on stated behaviour instead of measuring the actual behaviour, is a limiting factor to diffusion studies. For instance, measuring a website’s age entails the researchers sending emails to webmasters asking when their website initially got online. Nonetheless, a webmaster may not give a reply, may not even understand or even give wrong information (Howell, 2006).  The age of the domain name on the basis of the time an organization initially registered its domain name gives a real measure of the internet espousal. However, the age of the domain name is a measuring tool for the website evolution has limits. The names most register within a corporate domain such as .com, alterations within the domain name makes the documented age to be void. Also, an organization might buy domain names and then wait for months to host a website using that name, therefore it makes the registration date to be an unpredictable measure of the time a website went online. Utilization of data obtained from Wayback machine archiving real webpages assists in overwhelming limitations as well as developing the actual site’s date. The Malaysian hotel’s research for 3 reasons, for the industries getting online, travelling is leading other services in the industry of e-commerce. Also, hospitality e-commerce research lures upon innovation diffusion and utilization of these research trends enables nomological validity testing (Howell, 2006). The majority of tourism internet studies centre on developed countries whereas developing countries do not have research.  This can be because internet usage is in an initial stage for developing states like Malaysia.  The use of research on the Malaysian hospitality industry facilitates information in such domains whereas coincidentally attaining the research objectives.

Assumption and research questions

While the benefits of archiving internet sites may seem clear to those in the research community, many webmasters fear they will no longer have control over the life of the information created through their sites.  Therefore, we will analyze the following question:  Do members of the general public have a right to ownership for all items posted online which authorizes them to provide and terminate access to this data?  It’s worth noting that the Wayback Machine is generally exempt from copyright issues under the fair use doctrine and due to the collection’s educational purpose.  Further, webmasters could previously add a “no-follow” rule to their robots.txt file which would prevent crawlers from automatically cataloguing their site(s).  Once added, previously displayed captures on the archival site would be removed. The research will show why this should not have been allowed and was counter to the original intent of excluding sites from search engines.  Additionally, there are ethical questions that exist including the “right to be forgotten.”  Should someone post something unfavourable on a popular website, it may be deemed as offensive and taken down accordingly?  However, the Wayback Machine has an obligation to preserve this data as well for continued research.  To analyze this further, a focus group will be formed consisting of college instructors and students to gauge their opinion on the levels of harm that may result from this data preservation.  

To guide and shape the research the following research questions will be essential: To what level can an archive be democratic? Can a curatorial process be democratic fully? And how can such an archive be established?


In essence, these articles advocate that there is an arising and substantial usage of the Wayback machine in various fields, entailing social science research and the location of vital documents and their first appearance. In some instances, the Wayback Machine is used as a data source for carrying out quantitative analysis. Therefore, this research fits into this latter group, these articles will facilitate this research to achieve its objectives and also will contribute to the existing studies that have already been undertaken through the identification and the description of the process of website information retrieval and analysis from the Wayback machine for quantitative social science study.