Skip to Content

Found 179,806 Resources

Archive-It 5.0

Smithsonian Institution Archives
Jennifer Marien, Intern, Digital Services Division

Archive-It 5.0 Changes and New Features

As a web preservation intern at the Smithsonian Institution Archives, I capture and preserve the Smithsonian’s web presence using the Archive-It crawling service. In October 2014, Archive-It released Phase 1 of Archive-It 5.0, which featured the roll-out of a new interface and more robust data collection for post-crawl reports. Currently the service allows users to switch between the 4.9 and 5.0 versions. Archive-It offers ten new features for reports, which include quick text box filter, infographics, the ability to add notes, and the option to compare two crawl reports side by side. The reports generated by web crawls play a large part in the Archives’ web collection packages and quality assurance (QA), so the changes between versions 4.9 and 5.0 are important for us to understand as we attempt to preserve the record of the ephemeral web. 

Version 4.9 Crawl Report

Archive-It 4.0 Crawl Report

Version 5.0 Crawl Report

Archive-It 5.0 Crawl Report

Archive-It One-Time IDs

The snapshots above were taken of the same crawl report, one in 4.9 and the other in 5.0. The new format and interface are not the only differences. The one-time IDs (identifiers) are different. For this crawl version 4.9 was assigned 20150320165024358, and version 5.0 was assigned 149112. While the Archives does not fully rely on these numbers as identifiers for crawls, they are attached to the file name when a summary/overview report and the WARC files are downloaded for our collections. Currently, the ability to switch back and forth between 4.9 and 5.0 makes this issue moot, but once this capability is removed those reports and WARC files downloaded with the 4.9 ID will be more difficult to locate and identify in the new Archive-It reports. Archive-It does not mention this change on the Wikis it has provided regarding the roll-out of 5.0. This change could be problematic for those organizations who use these IDs to identify crawls.

Report Summary Data

Part of our web collection packages include downloading the host data and the report summary from the post-crawl report. The host download provides the URLs that were archived from each host as well as other information such as new data, documents blocked by robots.txt, and out-of-scope documents. When switching between 4.9 and 5.0, the only change is the interface and the ability to browse hosts by seed for more robust data.

When viewing 5.0, the report summary is now called an overview but with the same type of data. However, I noticed a few discrepancies. The data is not consistent when switching between the two versions. The snapshots of the same crawl above show different numbers for the Total Documents Archived. Version 4.9 archived 12,440 documents while version 5.0 archived 12,386 documents. It is unclear why the data is different when switching between the two versions.         

New Features Overall

The interface of the 5.0 reports page is an improvement. The one-time IDs are now visible, however the collection name is cut off if the collections name is too long, requiring users to hover over the name to see it in its entirety. The reports page quick text box filter is a helpful feature. The search function is more flexible than 4.9, which only allowed searches by collection name or date.

The new view feature provides users with a link from the reports page directly to the Wayback Machine to view the URL without having to navigate to this resource through the access tab. This feature can help improve our quality assurance (QA) workflow. QA involves ensuring our crawl and capture of the site accurately represents what the website displayed at the time of the crawl. Wayback allows us to view the crawl results visually in website form unlike the reports and hosts which provide numerical data about the crawl.

Overall, the 5.0 features are an improvement on this service, which is an important tool for archiving the record of the Smithsonian today.

Related Collections

Related Resources

Blog Categories: 
Blog Tags: 

Earth Archive

Smithsonian American Art Museum

Donald Deskey Archive

Cooper Hewitt, Smithsonian Design Museum
Project files contain magazine and newspaper clippings, reviews, correspondence, renderings, floor plans, perspective drawings, site plans, sketches, preliminary drawings, patents, stationery, labels, and technical reports. There is an extensive collection of photographs and slides of many of Deskey's packaging designs, interiors, furnishings, and exhibition installations. The files of Donald Deskey Associates include organizational charts, client lists, proposals and financial records. Some of Deskey's personal correspondence, speeches, articles, and family photographs are included. Materials cover the period from 1927-1975.

Trude Guermonprez Archive

Cooper Hewitt, Smithsonian Design Museum
This archive includes interesting documents related to Trude Guermonprez's life and work. The archives are specially related to the designer's work for her major clients, like Holland America Line and Owens Corning Fiberglass; other pieces in this archive are related to Guermonprez's work for the curtains made for major synagogues and her designs, interior fabrics, screens and rugs realized in conjunction with J.P.Oud, Architects associated, New York, Eric Mendelsohn, Warren Callister, etc.

The correspondence and the photographs in this collection allow us to know better the designer's private life.

Henry Dreyfuss Archive

Cooper Hewitt, Smithsonian Design Museum
The work of Henry Dreyfuss, constantly focused on the needs of the average consumer, had a profound impact on the daily lives of millions of Americans. His firm had hundreds of clients (American Airlines, A T & T, Deere & Company, Hallmark Cards Incorporated, Polaroid Corp.) and worked on thousands of items. The material included in this archive does not cover all clients and projects undertaken by Dreyfuss. This collection consists of theater design materials, industrial design materials, primarily, though not exclusively, from the 1950s and 60s, draft copies of his books, including extensive research files for the "Symbol Soursebook", texts of lectures delivered by Dreyfuss, and biographical material. Included is Dreyfuss's Brown Book which provides an outline of his achievements. Photographs and slides of many of his designs are included. Materials relating to three publications include original drafts of the books with author notes, drawings, photographs, correspondence, and research materials. Also contains materials relating to the symbols exhibition held at the Hallmark Gallery in New York City in 1972.

311 reels of microfilm documenting most of the projects undertaken by Dreyfuss Associates were created by the firm and added to the collection later.

Materials are arranged into four record group:

1) Biographical information;

2) Theater design;

3) Industrial design;

4)Publications.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - Unknown

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - Unknown

NMNH - Entomology Dept.

Illustration Archive - SI

NMNH - Entomology Dept.

Illustration Archive - Unknown

NMNH - Entomology Dept.

Illustration Archive - Cornell

NMNH - Entomology Dept.

Illustration Archive - SI

NMNH - Entomology Dept.

Illustration Archive - SI

NMNH - Entomology Dept.

Illustration Archive - SI

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - SEL

NMNH - Entomology Dept.

Illustration Archive - Unknown

NMNH - Entomology Dept.
1-24 of 179,806 Resources