The Scholarly Data Archive (SDA)
Note: Due to a possible data corruption issue with HTAR versions 4.0 and greater, you should update the HSI/HTAR client on your personal workstation as soon as possible to the patched version made available March 28, 2013, by the UITS Research Storage team. To download the patched version and get further information about this issue, see the Research Storage HSI page. If you have questions, email Research Storage.
On this page:
- System overview
- System information
- Working with electronic protected health information
- System access
- Transferring files
- Acknowledging grant support
The Indiana University Scholarly Data Archive (SDA) provides extensive capacity (42 PB) for storing and accessing research data. The SDA is a distributed storage service co-located at IU data centers in Bloomington and Indianapolis. The SDA provides IU researchers with large-scale archival or near-line data storage, arranged in large files, with automatic off-site copies of data for disaster recovery.
Access is available to IU graduate students, faculty, and staff. Undergraduates and non-IU collaborators must have IU faculty sponsors. For details, see the "Research system accounts (all campuses)" section of Account availability and eligibility
The SDA supports high-performance access methods, such as parallel FTP (PFTP) and Hierarchical Storage Interface (HSI); an HPSS API is available for programmers, as well.
The SDA uses the consortium-developed High Performance Storage System (HPSS), a hierarchical storage management (HSM) software package that makes transparent to its users a hierarchy of storage media used to provide massive data storage capacity. This hierarchy includes disk caches totaling roughly 600 TB back-ending into two high-end tape libraries, providing a total uncompressed data storage capacity of nearly 15 PB. This near-line, tape-based storage system, mediated by fast, efficient disk caches, gives users the appearance of massive disk capacity at a fraction (usually a hundredth) of the cost of storing the same data on spinning disks.
Note: At IU, the initials SDA and HPSS are often used interchangeably to describe the same service.
Although the names of files files placed on the SDA remain visible to the user, the actual data migrate to tape when they haven't been accessed for a certain period of time. When data have migrated to tape, their retrieval can require up to two minutes per file as the tape robot must locates, mounts, and reads the appropriate tape. Due to the overhead involved in manipulating data this way, the SDA is not well suited for storing a large number of small files.
The SDA is the first HPSS system in the world to offer the
hpssfs interface in production. The SDA is also the first
HPSS system in the world to implement a remote data mover (at
IUPUI). IU's remote data mover has demonstrated the feasibility of a
widely distributed (i.e., across a wide area network or
WAN) HPSS in which data stored and accessed by users at
IUPUI are served locally by the IUPUI data mover at high, local
area network (LAN) speeds. A small stream of metadata
(administrative data about the files stored) flows on the WAN segment
between IUB and IUPUI (this is necessary because the metadata engine
is located at IUB). Such a widely, geographically distributed storage
system design is highly cost effective and is of great interest to
Since the institution of the I-Light high-performance network between IUB and IUPUI in 2001, the SDA HPSS system is able to create two tape copies of user data (one at IUB and another at IUPUI), adding a degree of disaster tolerance to both sites.
Note: The SDA is offline for regularly scheduled maintenance every Sunday 7am-10am.
|Machine type||Distributed HPSS data archive|
|Operating system||Red Hat Enterprise Linux 5|
|Network file system protocols||HSI/HTAR, CIFS (Samba), SFTP/SCP, HTTPS|
|Total tape capacity||15 PB|
|Total disk capacity (cache)||600 TB|
|Quotas||50 TB (default) per user, 50 TB (default) per project; increases as needed|
|Backup and purge policies||Dual copies of data, but no backups; system is never purged|
|Aggregate I/O||80 Gbps|
Working with electronic protected health information
Although this and other UITS systems and services have been approved by IU by the Office of the Vice President and General Counsel (OVPGC) as appropriate for storing electronic protected health information (ePHI) regulated by the Health Insurance Portability and Accountability Act of 1996 (HIPAA), if you use this or any other IU IT resource for work involving ePHI research data:
- You and/or the project's principal investigator (PI) are
responsible for ensuring the privacy and security of that data, and
complying with applicable federal and state laws/regulations and
institutional policies. IU's policies regarding HIPAA compliance
require the appropriate Institutional Review Board (IRB) approvals and
a data management plan.
- You and/or the project's PI are responsible for implementing HIPAA-required administrative, physical, and technical safeguards to any person, process, application, or service used to collect, process, manage, analyze, or store ePHI data.
Important: Although UITS HIPAA-aligned resources are managed using standards meeting or exceeding those established for managing institutional data at IU, and are approved by the IU Office of the Vice President and General Counsel (OVPGC) for storing research-related ePHI, they are not recognized by the IU Committee of Data Stewards as appropriate for storing other types of institutional data classified as "Critical" that are not ePHI research data. To determine which services are appropriate for storing sensitive institutional data, including ePHI research data, see Comparing supported data classifications, features, costs, and other specifications of file storage solutions and services with storage components available at IU.
For more, see:
- What are my responsibilities when using UITS systems for work with electronic protected health information?
- At IU, what types of sensitive data are appropriate for the research computing systems?
The UITS Advanced Biomedical IT Core (ABITC) provides consulting and online help for IU researchers who need help securely processing, storing, and sharing ePHI research data. If you need help or have questions about managing HIPAA-regulated data at IU, contact Anurag Shankar at ABITC. For additional details about HIPAA compliance at IU, see HIPAA & ABITC and the Office of Vice President and General Counsel (OVPGC) HIPAA Privacy & Security page.
- For instructions on requesting an individual SDA or RFS account,
see Instructions for getting additional computing accounts at IU
- For instructions on requesting an SDA or RFS account for an IU group or department, see About requesting a departmental or group account.
After submitting your account request, UITS will notify you via email when your account is ready for use.
Once you have an SDA account, you can access it from any networked host that offers at least a TCP/IP-based FTP client.
Methods available for transferring data to and from the Indiana
University Scholarly Data Archive (SDA) include
Kerberos-enabled FTP, parallel FTP
pftp_client), Hierarchical Storage Interface (HSI),
secure FTP (SFTP), secure copy (SCP),
SMB/CIFS/Windows file sharing (SMB), and
https (via a web
browser). For instructions, see:
- At IU, how do I use parallel FTP to transfer data to or from the SDA?
- At IU, how do I use HSI to access my SDA account?
- At IU, how do I use SFTP or SCP to access my SDA account?
- At IU, how do I map or mount my SDA account to my workstation?
- At IU, how do I use the Scholarly Data Archive web interface?
The method you use depends on your operating system and level of comfort with the command line interface.
highest performing non-grid methods, requires installing special
pftp_clientinterface, available on Quarry and Mason, is compatible with only Unix and Linux. It is more efficient than
hsi, allowing for parallel, high-bandwidth (up to 200 MB/s) data transfers between HPSS and jobs running on the IU research clusters. There is also a benefit in using
pftp_clientwith sequential data transfers, because
pftp_clientarranges for a connection directly with an HPSS mover, so data do not have to flow through the transaction engine node. UITS recommends this model to researchers who plan to make very large data transactions and need high bandwidth.
hsiinterface is compatible with Linux, Mac OS X, and Windows, and is somewhat more functional than the
pftp_clientinterface. It provides shell-like facilities for recursive operations, such as the ability to take data from standard input and force migration, as well as staging and purging.
For Windows or Mac OS X users who prefer a graphical interface, UITS recommends using a graphical SFTP client. For Mac OS X users, UITS recommends Fetch, especially if you intend to transfer large amounts of data.
Acknowledging grant support
The Indiana University cyberinfrastructure managed by the Research Technologies division of UITS is supported by funding from several grants, each of which requires you to acknowledge its support in all presentations and published works stemming from research it has helped to fund. Conscientious acknowledgment of support from past grants also enhances the chances of IU's research community securing funding from grants in the future. For the acknowledgment statement(s) required for scholarly printed works, web pages, talks, online publications, and other presentations that make use of this and/or other grant-funded systems at IU, see Grants to cite in published papers and presentations