Indiana University
  •  
  •  
  •  

The Scholarly Data Archive (SDA)

Note: Due to a possible data corruption issue with HTAR versions 4.0 and greater, you should update the HSI/HTAR client on your personal workstation as soon as possible to the patched version made available March 28, 2013, by the UITS Research Storage team. To download the patched version and get further information about this issue, see the Research Storage HSI page. If you have questions, email Research Storage.

On this page:


System overview

The Indiana University Scholarly Data Archive (SDA) provides extensive capacity (42 PB) for storing and accessing research data. The system is located at both IU Bloomington and IUPUI, providing automatic off-site copies of data for disaster recovery.

The SDA is a distributed storage service offered to IU graduate students, faculty, and staff needing large-scale archival or near-line data storage, arranged in large files, for their research projects.

The SDA uses the consortium-developed High Performance Storage System (HPSS), a hierarchical storage management (HSM) software package that makes transparent to its users a hierarchy of storage media used to provide massive data storage capacity. This hierarchy includes disk caches totaling roughly 600 TB back-ending into two high-end tape libraries, providing a total uncompressed data storage capacity of nearly 15 PB. This near-line, tape-based storage system, mediated by fast, efficient disk caches, gives users the appearance of massive disk capacity at a fraction (usually a hundredth) of the cost of storing the same data on spinning disks.

Note: At IU, the initials SDA and HPSS are often used interchangeably to describe the same service.

Although the names of files files placed on the SDA remain visible to the user, the actual data migrate to tape when they haven't been accessed for a certain period of time. When data have migrated to tape, their retrieval can require up to two minutes per file as the tape robot must locates, mounts, and reads the appropriate tape. Due to the overhead involved in manipulating data this way, the SDA is not well suited for storing a large number of small files.

Once you have an SDA account, you can access it from any networked host that offers at least a TCP/IP-based FTP client. The SDA supports high-performance access methods, such as parallel FTP (PFTP) and Hierarchical Storage Interface (HSI), and an HPSS API is available for programmers, as well.

The SDA is the first HPSS system in the world to offer the hpssfs interface in production. The SDA is also the first HPSS system in the world to implement a remote data mover (at IUPUI). IU's remote data mover has demonstrated the feasibility of a widely distributed (i.e., across a wide area network or WAN) HPSS in which data stored and accessed by users at IUPUI are served locally by the IUPUI data mover at high, local area network (LAN) speeds. A small stream of metadata (administrative data about the files stored) flows on the WAN segment between IUB and IUPUI (this is necessary because the metadata engine is located at IUB). Such a widely, geographically distributed storage system design is highly cost effective and is of great interest to many.

Since the institution of the I-Light high-performance network between IUB and IUPUI in 2001, the SDA HPSS system is able to create two tape copies of user data (one at IUB and another at IUPUI), adding a degree of disaster tolerance to both sites.

Back to top

System information


System configuration Aggregate information
Machine type Scholarly Data Archive (SDA)
Operating system Red Hat Enterprise Linux 5
Storage information Aggregate information
Network file system protocols HSI/HTAR, CIFS (Samba), SFTP/SCP, HTTPS
Total tape capacity 15 PB
Total disk capacity (cache) 600 TB
Availability scope Access to the SDA is available to all IU graduate students, faculty, and staff. Undergraduates and non-IU collaborators must have IU faculty sponsors.
Quotas 5 TB (default) per user, 5 TB (default) per project; increases as needed
Backup and purge policies Dual copies of data, but no backups; system is never purged
Aggregate I/O 80 Gbps

Back to top

System access

Accounts on the Research File System (RFS) and Scholarly Data Archive (SDA) are available to graduate students, faculty, and staff at all Indiana University campuses. If you are an undergraduate at any IU campus and need access to RFS or SDA, you must be sponsored by a faculty member for specific research projects.

To apply for an account on the SDA or RFS if you have an individual or affiliate Network ID, see Instructions for getting more computing accounts at IU

To apply for a group or departmental SDA or RFS account, see Requesting a departmental or group account

You will receive an email message informing you when your account is ready for use.

Back to top

Transferring files

Methods available for transferring data to and from the Indiana University Scholarly Data Archive (SDA) include Kerberos-enabled FTP, parallel FTP (pftp_client), Hierarchical Storage Interface (HSI), secure FTP (SFTP), secure copy (SCP), SMB/CIFS/Windows file sharing (SMB), and https (via a web browser). For instructions, see:

The method you use depends on your operating system and level of comfort with the command line interface.

Access by pftp_client and hsi, the highest performing non-grid methods, requires installing special clients:

  • The pftp_client interface, available on Big Red, Quarry, and Mason, is compatible with only Unix and Linux. It is more efficient than hsi, allowing for parallel, high-bandwidth (up to 200 MB/s) data transfers between HPSS and jobs running on the IU research clusters. There is also a benefit in using pftp_client with sequential data transfers, because pftp_client arranges for a connection directly with an HPSS mover, so data do not have to flow through the transaction engine node. UITS recommends this model to researchers who plan to make very large data transactions and need high bandwidth.

  • The hsi interface is compatible with Linux, Mac OS X, and Windows, and is somewhat more functional than the pftp_client interface. It provides shell-like facilities for recursive operations, such as the ability to take data from standard input and force migration, as well as staging and purging.

For Windows or Mac OS X users who like working with graphical interface, UITS recommends using a graphical SFTP client. For Mac OS X users, UITS recommends Fetch, especially if you intend to transfer large amounts of data.

Back to top

Reference

See On the Scholarly Data Archive at IU, what are classes of service, and how do I use them?

Back to top

Policies and best use

The Research Storage systems (Research File System and Scholarly Data Archive) are deemed secure enough for storing Indiana University institutional data at all classification levels: critical, restricted, university-internal, and public. For more about classification levels for institutional data at IU, see Classifications of Institutional Data.

However, many types of data in the above classifications pertain to administrative functions of the university. You should not use the Research Storage systems as primary repositories of such data.

Research Storage systems are HIPAA-aligned, and using them for storing health-related data meets HIPAA storage requirements. However, you should encrypt electronic protected health information (ePHI) data before storing them on the SDA. For more on HIPAA and its implications for IU research systems see:

Research Storage systems should not be used to store clinical data. They do not have sufficient reliability of data delivery to match clinical requirements.

Individuals using Research Storage systems for university work are responsible for ensuring that sensitive information is not stored in unapproved or inappropriate locations. If you are uncertain about the classification, privacy, and confidentiality aspects of your data, contact the Committee of Data Stewards for advice.

Back to top

Support

The SDA is maintained by the Research Storage team. If you have questions or need help, email Research Storage.

Back to top