Links

Large Data Support

Learn about ESS-DIVE's large data support tools and how they're used to upload and publish large data.
ESS-DIVE now has a Tier 2 data storage service to support publishing very large, hierarchical datasets that can be directly accessed from our repository. ESS-DIVE uses Globus, a data transfer service, to make it easier to upload large data to ESS-DIVE. The Tier 2 and Globus services are setup offline with close assistance from the ESS-DIVE Team.

How to contact ESS-DIVE about publishing large data:

For large data support, please email us at [email protected] with the following information about your data:
  1. 1.
    What's the total file volume of your dataset?
  2. 2.
    Approximately how many files are in your dataset and what's the range of file sizes?
  3. 3.
    Is the data structure hierarchical? If yes, can you easily flatten your data structure?
  4. 4.
    Where is your data stored currently (e.g. local desktop, cloud based server, Google Drive)?

Globus: Upload Large Data

Globus (https://www.globus.org/) is a free, cloud-based data transfer service designed to move significant amounts of data. ESS-DIVE uses this service to move data from your local desktop or existing Globus endpoint to ESS-DIVE's storage services. This large data support tool can be used to resolve common upload errors or as the default upload method for data greater than 500GB.
Learn more about and how to use Globus for publishing data on ESS-DIVE or resolving upload issues via our Globus documentation page.
Figure 1: The Globus file manager (pictured) is accessible via browser and is used as the primary interface for transferring data with Globus.

Tier 2: Storage for Large Data

Tier 2 (Figure 2) is ESS-DIVE's extended storage resource that is used to store very large, hierarchical datasets, instead of storing the data directly on ESS-DIVE's dataset landing pages, or Tier 1 (Figure 3). Data greater than 500GB in volume will be archived on Tier 2 by default. Additionally, Tier 2 supports the functionality to browse hierarchical folders in your browser prior to download.
Data stored on Tier 2 resources can be accessed and downloaded from the Tier 2 landing page (Figure 2). This is separate from ESS-DIVE's dataset landing page (Figure 3). You can choose to publish some or all of your dataset files on Tier 2.
Generally, data should be stored on Tier 1 whenever possible. ESS-DIVE is constantly expanding and improving features on Tier 1 that may not be supported on Tier 2. However data less than 500GB can be published on Tier 2 if necessary.
Any data contributor can take advantage of the Tier 2 service even if your data is less than 500GB. Please contact ESS-DIVE at [email protected] to discuss if your data is suitable for Tier 2 storage.
Figure 2: Tier 2 landing page for large file exploration and download. Access to dataset metadata on Tier 1 is provided via link.
Figure 3: Tier 1 dataset landing page where metadata and data can be discovered and downloaded. Access to files on Tier 2 are provided via external link.
Data contributors must use the Globus transfer service to upload their data to Tier 2. Once uploaded to Globus, ESS-DIVE will organize the data and add additional file metadata on the Tier 2 landing page. The data contributor will review and approve the data on Tier 2 prior to publication. At the time of publication, the data will be publicly accessible on the Globus "ESS-DIVE Public Share" collection, as well as, on the Tier 2 website. Additionally, external links to both Tier 2 and Globus will be added to the dataset metadata landing page for access and download (demonstrated in Figure 3).

Management and Preservation of Tier 2 Data

ESS-DIVE stores redundant copies of data published on Tier 2 resources to preserve and provide long-term access to Environmental Systems Science (ESS) research data.
Please be aware that, at this time, the following features are not available for Tier 2 data:
  1. 1.
    Will not be linked to the DataOne federation,
  2. 2.
    Cannot be private, and
  3. 3.
    Data downloads and views will not be factored into data package statistics.

How to Download Data From Tier 2

Tier 2 data files are accessible for download via the external link table listed at the top of the dataset metadata landing page (Figure 3). In this table, there are two options to choose from when downloading Tier 2 data: either HTTP or Globus.

Download data from Tier 2 (HTTP):

The HTTP access link can be used to download individual files directly from your browser or, for those familiar with command line tools, it can be used to download data in bulk.
  1. 1.
    Locate and open the dataset landing page on ESS-DIVE: https://data.ess-dive.lbl.gov/.
  2. 2.
    Scroll down to the "External Links to Data and Metadata" table and select the URL next to the external link titled "ESS-DIVE Tier 2 (HTTP)"; this will redirect you to the Tier 2 data page (Figure 2).
  3. 3.
    All data files will be located in the data/ folder.
  4. 4.
    You can browse and download files individually as needed by selecting the file.
  5. 5.
    If you are familiar with command line tools, you can initiate a bulk download with wget or curl by pulling all the files in the data folder. The manifest-md5.txt contains a list of all files with MD5 sums.
    • Example code coming soon

Download data from Tier 2 (Globus):

The Globus access link can be used to download multiple files at once from your browser. Additionally, you can choose to download the files locally or to transfer them to an existing cloud storage service, if applicable. To download files locally, it is necessary to download Globus Connect Personal.
  1. 1.
    Locate and open the dataset landing page on ESS-DIVE: https://data.ess-dive.lbl.gov/.
  2. 2.
    Scroll down to the "External Links to Data and Metadata" table and select the URL next to the external link titled "ESS-DIVE Tier 2 (Globus)"; this will redirect you to the public Globus collection file manager (Figure 1).
  3. 3.
    You will be prompted to login before you can access the data on Globus. If you are new to Globus, see ESS-DIVE's instructions for logging in to Globus before proceeding.
  4. 4.
    Select the directory or files that you would like to download.
  5. 5.
    Decide where you want to download the data (locally or existing cloud service).
    • Local download: If you are new to Globus, see ESS-DIVE's instructions to setup Globus Connect Personal and setup a collection for your local machine.
    • Existing cloud service: If you do not already have your existing cloud service accessible via Globus, it will be necessary to initialize an external collection. See Globus' documentation to learn how to initialize a collection.
  6. 6.
    In the empty panel in your file manager, search for and select the collection where you would like to download the data.
    • See ESS-DIVE's instructions for using the Globus file manager to transfer or download data.
  7. 7.
    Drag and drop files from the data collection into the remote endpoint or select the files and hit the transfer button.