Ocean Access (S3-compatible API)

Ocean provides an S3-compatible API for accessing your licensed datasets programmatically. This is the recommended approach for developers and for downloading large datasets efficiently.

Prerequisites

To access datasets via Ocean, you need:

The slug of the dataset and organization, these are displayed on the dataset access page
An access key (created in Settings → Access Keys)

Creating an Access Key

Log into app.humannative.ai
Navigate to Settings → Access Keys
Click “New access key” and follow the form
Important: Copy the secret immediately - the secret will only be shown once!

AWS CLI Setup

Configure AWS Profile

Add the following to your ~/.aws/config file:

[profile ocean]
endpoint_url = https://ocean.humannative.ai
s3 =
    addressing_style = path

Add the following to your ~/.aws/credentials file:

[ocean]
aws_access_key_id = AKIA.... # Your access key ID
aws_secret_access_key = <your-secret-access-key>

Using the Profile

Set the AWS profile environment variable:

export AWS_PROFILE=ocean

Using AWS CLI

Once configured, you can use the AWS CLI as normal:

# List datasets for your organization
aws s3 ls <org-slug>

# Download a specific file
aws s3 cp s3://<org-slug>/<dataset-slug>/filename.ext ./

# Sync an entire dataset
aws s3 sync s3://<org-slug>/<dataset-slug>/ ./local-folder/

Using obstore (Python)

obstore is a Python library that provides async access to object stores. Here are examples of how to use it with Ocean:

Installation

pip install obstore

For the advanced examples below, you’ll also need:

pip install obstore boto3

Basic Usage with credentials set on environment

This is the bare minimum of code needed to access a file with obstore. Make sure to set OCEAN_ACCESS_KEY_ID and OCEAN_SECRET_ACCESS_KEY.

import obstore as obs
from obstore.store import S3Store

store = S3Store(
    "<org-slug>", # The org and dataset slug can be retrieved on the dataset page
    prefix="<dataset-slug>",
    endpoint="https://ocean.humannative.ai",
    access_key_id=os.getenv("OCEAN_ACCESS_KEY_ID"),
    secret_access_key=os.getenv("OCEAN_SECRET_ACCESS_KEY"),
    virtual_hosted_style_request=False
)

# List files
files = obs.list(store).collect()
print(files)

# Download a file
resp = obs.get(store, "example.mp3")
print(resp.bytes())

Recommended Usage with AWS Profile

See the AWS CLI setup section above for configuring your AWS credentials. Then export AWS_PROFILE=ocean and run this code:

import obstore as obs
from obstore.store import S3Store
from boto3 import Session
from obstore.auth.boto3 import Boto3CredentialProvider

# Use AWS profile for authentication (recommended)
boto_session = Session()
credential_provider = Boto3CredentialProvider(boto_session)

store = S3Store(
    "<org-slug>", # The org and dataset slug can be retrieved on the dataset page
    prefix="<dataset-slug>",
    endpoint="https://ocean.humannative.ai",
    virtual_hosted_style_request=False,
    credential_provider=credential_provider,
)

# List files
files = obs.list(store).collect()
print(files)

# Download a file
resp = obs.get(store, "example.mp3")
print(resp.bytes())

Advanced Example: Downloading Entire Datasets

For downloading complete datasets with concurrent downloads:

import asyncio
from pathlib import Path
from obstore.store import S3Store
from boto3 import Session
from obstore.auth.boto3 import Boto3CredentialProvider
import time
from datetime import timedelta
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from obstore.store import ClientConfig

async def download_file(store: S3Store, obj_meta: dict, local_file_path: Path, semaphore: asyncio.Semaphore):
    """Download a single file with concurrency control"""
    async with semaphore:
        try:
            print(f"Downloading {obj_meta['path']} to {local_file_path} ({obj_meta['size']} bytes)...")
            get_result = await store.get_async(obj_meta["path"])
            with open(local_file_path, "wb") as f:
                f.write(get_result.bytes())
            print(f"Successfully downloaded {local_file_path}")
        except Exception as e:
            print(f"Error downloading {obj_meta['path']}: {e}")

async def download_dataset(org_slug: str, dataset_slug: str, local_path: str = ".", concurrency: int = 2):
    """Download an entire dataset from Ocean"""

    # Setup authentication
    boto_session = Session()
    credential_provider = Boto3CredentialProvider(boto_session)

    client_config: ClientConfig = {}
    client_config["timeout"] = "2h"

    # Create store
    store = S3Store(
        bucket=org_slug,
        prefix=dataset_slug,
        credential_provider=credential_provider,
        client_options=client_config,
        endpoint="https://ocean.humannative.ai",
        virtual_hosted_style_request=False,
    )

    # Create local directory
    local_download_path = Path(local_path) / dataset_slug
    local_download_path.mkdir(parents=True, exist_ok=True)

    print(f"Listing objects for dataset {dataset_slug} from organization {org_slug}...")

    # List all objects
    objects = await store.list().collect_async()
    print(f"Found {len(objects)} objects")

    if not objects:
        print("No objects found")
        return

    # Create a semaphore to limit concurrent downloads
    semaphore = asyncio.Semaphore(concurrency)

    download_tasks = []

    for obj_meta in objects:
        # Skip directories (objects with size 0 and no extension)
        if obj_meta["size"] == 0 and "." not in obj_meta["path"]:
            print(f"Skipping directory: {obj_meta['path']}")
            continue

        local_file_path = local_download_path / obj_meta["path"]
        local_file_path.parent.mkdir(parents=True, exist_ok=True)

        task = download_file(store, obj_meta, local_file_path, semaphore)
        download_tasks.append(task)

    print(f"\nStarting parallel downloads to {local_download_path}...")
    print(f"Concurrency: {concurrency}")

    await asyncio.gather(*download_tasks)
    print("Download completed")


if __name__ == "__main__":
    # The org and dataset slug can be retrieved on the dataset page
    asyncio.run(download_dataset("<org-slug>", "<dataset-slug>", local_path="./downloads"))

Dealing with timeouts

Obstore has a client side timeout which is set to 30 seconds by default. This might not be enough when downloading large files. In order to increase it, you need to add a client_options map to the S3Store parameters.

    store = S3Store(
        # ...existing params
        client_options={
            "timeout": "10m",
        },
    )

Troubleshooting

Common Issues

Authentication Errors

Verify your access key ID and secret are correct
Ensure your AWS profile is properly configured
Check that AWS_PROFILE=ocean is set if using profiles

Connection Issues

Verify the endpoint URL: https://ocean.humannative.ai
Ensure virtual_hosted_style_request=False is set
Check your network connectivity

Permission Errors

Confirm you have access to the specified organization and dataset
Verify your access key has the necessary permissions