Web archiving at the Borthwick

The Borthwick Institute for Archives uses the Archive-It service to capture copies of websites selected for inclusion in its web archive. Archive-It is a subscription based web archiving service established by the Internet Archive and is used by organisations throughout the UK, Europe and North America for harvesting and managing web content. Archive-It is currently used at over 500 partner organisations, including university libraries and archives, government libraries and archives, museum and art libraries, historical societies and public libraries.

The Archive-It service captures copies of websites using versions of the open source crawling software programs Heritrix and Brozzler.

For more information, please see our Web Archiving policy.

Website owner FAQs:

Are you going to crawl my website as part of your web archive program?

We will always seek your consent and written permission before crawling your website as part of our web archive program. At this time, we will also discuss what web archiving involves, and share information with you about how frequently your site will be captured and how archived copies will be accessed.

University of York websites are captured as part of our University Archive.

How often will you crawl my website?

The frequency with which we crawl your website will depend on collection guidelines and the nature of individual websites. Websites that are actively maintained and updated will likely be captured and recaptured at regularly scheduled intervals, such as semi-annual or quarterly.

In some cases, a website may only be crawled once or for a specified time period. Occasionally, we may choose to discontinue regularly scheduled crawls. These cases might occur if:

The value of the site is limited to a specific time period (such as a conference, one-time event, or the work of a temporary committee);
The site has reached its end of life and will be taken down after harvesting;
The targeted website has exhibited no or minimal change for three consecutive years (as verified by Archive-It curatorial tools and manual review by staff).

Will crawling software impact the performance of my website?

We crawl websites at a rate designed so as not to interfere with performance. For actively updated websites, crawls will generally be run quarterly or semi-annually, and last for a few days. Very large websites may need to be crawled over a period of weeks. Once a crawl is complete, the crawler no longer interacts with your server. It is unlikely that Archive-It’s crawlers will impact the performance of your website, but if you encounter any issues or have any additional questions, please contact us at borthwick-institute@york.ac.uk.

My site has password-protected areas that require a user to log in. Will this protected content be archived?

No. The Borthwick does not archive password-protected content. If you believe that your site’s password-protected content should be included as part of our web archive, please contact us.

I have a robots.txt exclusion that blocks crawlers from accessing certain parts of my site. How does this affect your collecting activity?

In order to create an accurate snapshot of your website for future researchers, we aim to capture as much of your site as possible. We will not bypass robots.txt exclusions by default, but may contact you to request changes to rules or seek permission to override them where such exclusions prevent the capture of content.

Will you take over the hosting of my site?

No. By archiving your site, the Borthwick is preserving a static snapshot of your site at a particular time. The hosting, management and maintenance of your live website remains your responsibility.

Are you able to capture media, audio and video files?

Yes, downloadable media, audio and video files can usually be captured, although content hosted on third-party services like YouTube, Vimeo or Soundcloud can sometimes be challenging. For this reason, we may ask you to provide separate copies of media, audio and video files where possible, which we will store outside of the web archive and which may provide a more robust solution for long-term access.

Archived copies of sites won’t render files that are not linked and have to be retrieved from a database via user queries (for example, when a user must execute a search in order to retrieve and access a file).

How will archived copies of my site be accessed?

The Borthwick makes sites archived as part of its web archiving program freely available to the public via our Archive-It partner page, where website level metadata is supplied to allow for browsing and full-text search. A prominent banner informs users that they are viewing an archived web page, as opposed to the live site. Relevant links to archived copies of sites will also be included in collection guides and/or catalogue records.

By default, archived sites undergo a six-month embargo period before being made available. Where appropriate and in discussion with site owners, selected content may be subject to longer embargo periods.

Are there things that I can do to make my website easier for crawlers to capture?

If your website includes content that is challenging for Archive-It’s crawlers, Borthwick staff will work with you to find alternative solutions for capturing that content. This may mean storing copies of challenging files separately to the web archive and providing access via alternative means.

However, owners wishing to optimise their site design to support more complete archiving may be interested in Columbia University Libraries’ Guidelines for Preservable Websites.

User FAQs:

For what purposes may I use the Borthwick Web Archive?

Users may access the collection for non-commercial research purposes and private study. Content within the collection may be subject to intellectual property rights governed by local, national and/or international laws or regulations. In using this content, you are responsible for abiding by all applicable laws in connection with your research.

Users are also responsible for ensuring that they use archived web resources in an ethical manner. Ethical use includes ensuring that archived web content is not represented as the live or most current version of a site, as well as accurately citing any archived web content used in research.

In addition, users are responsible for any personal data concerning living individuals obtained from archived web resources. When you capture or take away personal information you become the data controller of this information, and are liable for it and any subsequent use made of it. It is your responsibility under Data Protection Law to ensure that your use of any personal data accessed via the Borthwick Web Archive is fair and lawful.

Why do the archived versions of some websites appear to be incomplete?

There are a number of reasons why the archived copy of a website may appear to be incomplete:

Some types of content are challenging or impossible to capture and/or reproduce, including JavaScript-driven navigation menus, streaming audio and video, and dynamic form and database-driven content.
We only capture publicly accessible content, so restricted or password-protected site pages will not be collected.
Typically, the scope of our crawls does not include links to content hosted on other websites. If you find that a link on an archived page no longer works, double check the URL–you will likely find that it belongs to a different site.

Notice and takedown:

The Borthwick Institute for Archives wishes to ensure that content made available as part of its web archive is lawful. If you object to material included as part of the Borthwick’s web archive, please fill out our Notice and Takedown form. We may disable access to the material in question while we assess your objection. If the material in question was supplied to us by a third party, we may need to contact them as part of our enquiries.