Project: HathiTrust (Shared Digital Repository)
Primary UITS contact: Brad Wheeler
Last update: April 10, 2009
Description: The HathiTrust leverages the tradition of leadership in collaboration among the institutions of the Committee on Institutional Cooperation (CIC). The HathiTrust operates under the leadership of the Repository Administrators (Indiana University and the University of Michigan), which also provide a large part of the funding. Additional governance and financial support are provided by the charter participating libraries of the CIC, and by other libraries and library consortia wishing to archive digital content.
Outcome: The HathiTrust offers persistent and high-availability storage for digitized book and journal content, beginning with the Google content from the CIC members and later extending to other digitized content. It will leverage technology investments and developments at the University of Michigan to build (through IU/UM collaboration) more generalized versions of Michigan's services and gain efficiencies from Michigan's investments.
HathiTrust governance: The executive management committee of the HathiTrust meets monthly and continues to work on a variety of issues ranging from HathiTrust finances to development priorities. The first meeting of the Operational Advisory Board took place in June. The agenda focused on a review of the CIC Steering Committee's Short- and Long-Term Functional Objectives and, where appropriate, status reports. It was agreed that some of these items would be best addressed by CIC collaborations, while others are the responsibility of the centrally funded effort. The CIC will soon convene a committee to help better define the objective to create a public interface for the HathiTrust.
We continue to have productive conversations with several other institutions about possible participation in the HathiTrust, and hope to provide information on our progress in this regard in future updates.
News:
General news
-
Datasets: Sample datasets containing the OCR of
volumes in HathiTrust are now available. These datasets are provided
in the same directory structure and format as they are stored in the
repository. They are intended to give researchers the opportunity to
develop routines that can be run later on larger portions of the
corpus. Interested parties should contact
hathitrust-datasets@umich.eduwith a description of the research they intend to conduct. More information is available at HathiTrust Datasets.
-
Coordination between UM and UC staff:
Collaboration ramped up significantly between teams at the University
of Michigan and the University of California in March, in preparation
for ingest of content from the University of California. Weekly
conference calls speeded the teams' progress in addressing a checklist
of ingest items including coordination of bibliographic information,
inclusion of coordinate data for OCR files, and reporting on ingested
volumes.
-
Ingest from Indiana University: Bibliographic
metadata from Indiana University has been received at the University
of Michigan, and is being loaded into local systems. Once the metadata
is loaded, ingest of content will begin.
- HathiTrust growth: Ingest rates decreased in March, with just under 130,000 volumes entering the repository. As in previous months, this decrease reflects the fact that ingest rates are matching the output of digital content from the University Michigan and the University of Wisconsin. When ingest of content from the University of California and Indiana University begins (projected for April) ingest rates will rise closer to our planned capacity of 500,000 volumes per month.
Deployment status
- Establishing Indiana mirror site: Deployment of indexing and access systems on the Indiana University repository instance was completed in March. The repository is now a fully functioning mirror of the site at the University of Michigan with load balancing and fail-over.
Development update
-
Storage: The partners purchased additional
storage for the Michigan and Indiana sites in March. The new storage
will be installed in April and May, respectively, bringing both
environments to approximately 320TB of capacity.
-
Large-scale search: We are using the results of
large-scale search testing done so far to develop a hardware
configuration for production Solr infrastructure. Investigations
continue into software solutions for improving response times for slow
queries.
-
Data API: The first draft of a functional
specification for the HathiTrust Data API is complete and has been
made available publicly on HathiTrust Data
API for feedback. Work on the implementation of this specification
is underway and will continue in parallel as feedback is received.
- Public discovery interface Initial development of the temporary beta catalog for HathiTrust is nearly complete, and the catalog will be released within the next several weeks. It will provide bibliographic search and faceted browse of all volumes in HathiTrust, integrating with the HathiTrust Page Turner to provide access to individual items. Integration with the Collection Builder application will be completed in a second phase of development.
Growth
- 129,819 new volumes were added in March 2009.
- As of April 1, 2009, the repository contained a total of 2,780,007
volumes.
- 30,758 public domain volumes were added in March, bringing the
total number of public domain volumes to 433,641 (15% of the total
content).
- Ingest of Wisconsin materials continued. As of April 1, 2009, HathiTrust contained 168,098 Wisconsin volumes.
Forecast for April development
- Continue to investigate ways to improve performance for slow
queries in large-scale search.
- Continue work on the HathiTrust Data API specification and gather
input from a broader audience.
- Continue coding the initial Data API implementation.
- Complete initial development of the temporary public beta catalog for HathiTrust.
Outages:
PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list to receive information about unscheduled outages.
We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:
- For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
- For minor work, weekdays from 6:30am-8am.
Advance notice for scheduled outages is given on business days, at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.
- Outages in March: HathiTrust was unavailable on Tuesday, March 3 from 7-8am EST and on Thursday, March 5 from 7-7:45am EST for operating system and database software upgrades.
- Outages planned for April/May: No outages are planned at this time.
HathiTrust Short- and Long-Term functional objectives: April 10, 2009
Short-term functional objectives:
-
Page turner mechanism: A page turner has been
deployed for all content in HathiTrust. We hope to report soon on a
strategy to re-engineer the current page turner application so that it
provides access to materials in HathiTrust through an API. The
intention is to provide a wider variety of functions or modes of
access to the collections than are currently available. A first draft
of the functional specification for the data API was completed in
January. Following internal discussion and revision, it was released in
April on the HathiTrust web site for broader comment. It is now
available at HathiTrust Data
API. Feedback on the specification is requested and should be sent
to
hathitrust-info@umich.edu. (Note that this API is separate from the API for extracting metadata from HathiTrust described below).
-
Branding (overall initiative; individual
libraries): After consultation with our partners, we released
several new elements that provide support for branding in the
HathiTrust repository. These elements include:
- The page turner now prominently identifies the HathiTrust
initiative.
- A watermark on every page identifies the digitizing
agent.
- A watermark on every page identifies the source library of the
print material.
- The source of the print material is included in our feed of
bibliographic identifiers so that institutions can import or update
records with this information.
- Finally, we will soon be adding an element identifying the relevant partner institution for patrons of that institution.
- The page turner now prominently identifies the HathiTrust
initiative.
-
Format validation, migration, and error-checking:
Format validation and error-checking is currently performed for all
content that enters HathiTrust. Although, to date, no migration of
content has been necessary, we believe that we have mitigated this
need by choosing rich, flexible, standards-based formats. We have
performed the work required to store a variety of technical and
digital preservation metadata along with each object in order to aid
in migration should it become necessary. Finally, the Isilon storage
automatically conducts periodic parity and media checks in the
background, a fairly unique feature in storage systems and one of the
reasons this storage system was seen as an appropriate match to the
project.
-
Development of APIs that will allow partner libraries to
access information and integrate it into local systems
individually: The HathiTrust partners identified the need for
a mechanism by which a bibliographic identifier (e.g., an ISBN or OCLC
number) can be submitted to a HathiTrust API and resolved as a
persistent URL with information about levels of access (e.g., full
text or search only). A preliminary version of such an API has been
released, and is being implemented in the online catalogs of several
partners. For more information, see HathiTrust Rights
API.
A second API, known as HathiTrust Data API, is available to provide secure access to HathiTrust data and metadata resources. Making these resources available to client applications (examples of current applications are the HathiTrust Collection Builder and Pageturner) will enable the creation of additional services and uses of repository materials. The specs for the Data API are available at HathiTrust Data API.
Other similar APIs will be developed as needed in the future.
-
Access mechanisms for persons with disabilities:
HathiTrust has deployed an interface for visually impaired users
(optimized for use with JAWS and other screen readers). This interface
presents to the user the entire text version, with navigation, on one
screen. Staff members at the University of Michigan are currently
working with UM School of Information interns to optimize this
interface for use with screen readers, as well as the general
accessibility of the pageturner. For in-copyright resources, access is
currently limited to authorized users at the University of
Michigan. We plan to add Shibboleth support to the HathiTrust
repository so that resources such as access mechanisms for persons
with disabilities can tie into the authentication environments of our
partner institutions.
-
Public 'Discovery' Interface for HathiTrust:
HathiTrust has initiated a multi-stage strategy to create a "public
interface" mechanism, an interface with which digital books and
journals in the HathiTrust repository can be discovered and
accessed.
- The first phase of this effort is the creation of a temporary
public beta of a comprehensive bibliographic search, to be made
available in April 2009. In the temporary public beta, we will
provide bibliographic search and faceted browse of all content in
HathiTrust, with the ability to restrict to all public domain
resources or volumes digitized from a specific institution's
collection. This public beta will also serve as a real-world proof of
concept for the second phase.
- A second phase has begun and involves active planning discussions
with OCLC on the creation of a "catalog" for HathiTrust. Chaired by
Lee Konrad (Wisconsin) and John Butler (Minnesota), this group will
create specifications for adaptation of WorldCat Local (WCL) for
HathiTrust. The deployment of the HathiTrust WCL interface is
scheduled for early 2010, with work ongoing throughout 2009.
- Subsequently, we will work to integrate this bibliographic discovery mechanism with full text searching. As this work progresses, we will provide updates in this space.
- The first phase of this effort is the creation of a temporary
public beta of a comprehensive bibliographic search, to be made
available in April 2009. In the temporary public beta, we will
provide bibliographic search and faceted browse of all content in
HathiTrust, with the ability to restrict to all public domain
resources or volumes digitized from a specific institution's
collection. This public beta will also serve as a real-world proof of
concept for the second phase.
-
Ability to publish virtual collections: Vast
bodies of digital content benefit from methods to gather together
subsets into "collections" that can be searched and
browsed. HathiTrust has created an early release of a Collection
Builder that permits individuals to create public (i.e., shared) and
private collections. We will turn our attention to creating mechanisms
by which persons such as bibliographers can create and share
collections with a more formal identity (cf. imagine having full text
resources associated with classic bibliographies such as the Wing or
Pollard and Redgrave short title lists). We are now performing
intensive usability review on the Collection Builder. Although the
Collection Builder's authentication and authorization now relies on
the University of Michigan "friend account" guest login system (see How to Set Up a Friend
Account for Guest Access to U-M Computing Resources), we will
work to add Shibboleth support to the HathiTrust repository so that
resources like the Collection Builder can tie into different
authentication environments.
- Mechanism for direct ingest of non-Google content: We are polling partner institutions for candidate digital book and journal collections that might be used for the creation of an ingest mechanism for content not digitized by Google.
Long-term functional objectives:
-
Compliance with required elements in the Trustworthy
Repositories Audit and Certification (TRAC) criteria and
checklist: HathiTrust has addressed most of the minimum
required elements in the TRAC criteria and checklist. All of the
required elements will receive ongoing attention, with incomplete
items being assigned the highest priority. In addition, the Center for
Research Libraries and HathiTrust have made plans for an independent
assessment of the HathiTrust repository, based largely on the Trusted
Repositories Audit and Certification (TRAC) criteria. The assessment
will take place during the summer of 2009. More information is
available on the CRL web site at 2009
Certification and Assessment of Portico and HathiTrust.
-
Robust discovery mechanisms like full-text
cross-repository searching: In January and February, our
experiments in large-scale search shifted from exploring different
hardware configurations to load testing. In March, we began using the
results of large-scale search testing done so far to develop a
hardware configuration for production Solr
infrastructure. Investigations also continued into software solutions
for improving response times for slow queries. Summaries of monthly
progress in search benchmarking are available at Large-Scale
Search. We continue to work toward a goal of being able to specify
the hardware and software required to support full text searching
(with Solr) of all volumes projected to be in the repository.
-
Development of an open service definition to make it
possible for partner libraries to develop other secure access
mechanisms and discovery tools: We believe that the great
wealth of resources that HathiTrust now makes available can only be
effectively exploited through the creation of an open service
definition that makes it possible for others to create new tools and
approaches to access. As a first step, we intend to create a parallel
production system that does not compromise the content in the
repository, and gives developers access to the functions of the
HathiTrust repository system. We hope that the availability of this
development sandbox will make it possible for partner institutions to
collaborate in creating new services through, for example, new or
expanded APIs. The HathiTrust Data API is an example of this. A draft
functional specification of the Data API has been completed, and is
available now for public comment at HathiTrust Data
API (more information on the Data API is available above in the
Short-term Objectives). Future strategies may also include the
implementation of Fedora as part of the repository management
infrastructure. Updates on our progress will appear in this
report.
-
Support for formats beyond books and journals:
Our first "content" priority is support for digitized books and
journals, but we believe that HathiTrust must expand its support to
other formats (particularly born-digital publications) and
materials. This is an area of future work.
-
Development of data mining tools for HathiTrust and use by
HathiTrust of other analysis tools from other sources:
Because of the vast bodies of content held by HathiTrust, an important
function of the HathiTrust repository will be to support data mining
and other forms of large-scale analysis. As a first step toward this
goal, HathiTrust has made sample datasets of two different sizes
available to researchers for computational processing and
analysis. The first sample is available to all researchers through an
application process. The second sample will be available to
participants in the Digging
Into Data Challenge. The samples are described below:
-
Sample 1: The first sample is composed of 5,000
texts, which may be requested in one of three bundles. Texts in all
bundles are pre-1923 (pre-1869 for works published outside of the
United States) and are as follows:
- A random sample representing four character sets and five languages (Arabic, English, French, Japanese, and Russian)
- A random sample of English language literary and historical texts
- A random sample of Classics texts, including original language texts and translations.
- Sample 2 - Digging Into Data: A second sample of 50,000 texts will be made available for participants in the Digging into Data Challenge. The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature). More information about these datasets, as well as specifications of file formats and modes of access, will be posted soon on HathiTrust.org.
-
Sample 1: The first sample is composed of 5,000
texts, which may be requested in one of three bundles. Texts in all
bundles are pre-1923 (pre-1869 for works published outside of the
United States) and are as follows:
More information is available at HathiTrust Datasets.

