Ocean Access (S3-compatible API)
Ocean provides an S3-compatible API for accessing your licensed datasets programmatically. This is the recommended approach for developers and for downloading large datasets efficiently.
Prerequisites
Section titled “Prerequisites”To access datasets via Ocean, you need:
- The slug of the dataset and organization, these are displayed on the dataset access page
- An access key (created in Settings → Access Keys)
Creating an Access Key
Section titled “Creating an Access Key”- Log into app.humannative.ai
- Navigate to Settings → Access Keys
- Click “New access key” and follow the form
- Important: Copy the secret immediately - the secret will only be shown once!
AWS CLI Setup
Section titled “AWS CLI Setup”Configure AWS Profile
Section titled “Configure AWS Profile”Add the following to your ~/.aws/config file:
[profile ocean]endpoint_url = https://ocean.humannative.ais3 = addressing_style = pathAdd the following to your ~/.aws/credentials file:
[ocean]aws_access_key_id = AKIA.... # Your access key IDaws_secret_access_key = <your-secret-access-key>Using the Profile
Section titled “Using the Profile”Set the AWS profile environment variable:
export AWS_PROFILE=oceanUsing AWS CLI
Section titled “Using AWS CLI”Once configured, you can use the AWS CLI as normal:
# List datasets for your organizationaws s3 ls <org-slug>
# Download a specific fileaws s3 cp s3://<org-slug>/<dataset-slug>/filename.ext ./
# Sync an entire datasetaws s3 sync s3://<org-slug>/<dataset-slug>/ ./local-folder/Using obstore (Python)
Section titled “Using obstore (Python)”obstore is a Python library that provides async access to object stores. Here are examples of how to use it with Ocean:
Installation
Section titled “Installation”pip install obstoreFor the advanced examples below, you’ll also need:
pip install obstore boto3Basic Usage with credentials set on environment
Section titled “Basic Usage with credentials set on environment”This is the bare minimum of code needed to access a file with obstore.
Make sure to set OCEAN_ACCESS_KEY_ID and OCEAN_SECRET_ACCESS_KEY.
import obstore as obsfrom obstore.store import S3Store
store = S3Store( "<org-slug>", # The org and dataset slug can be retrieved on the dataset page prefix="<dataset-slug>", endpoint="https://ocean.humannative.ai", access_key_id=os.getenv("OCEAN_ACCESS_KEY_ID"), secret_access_key=os.getenv("OCEAN_SECRET_ACCESS_KEY"), virtual_hosted_style_request=False)
# List filesfiles = obs.list(store).collect()print(files)
# Download a fileresp = obs.get(store, "example.mp3")print(resp.bytes())Recommended Usage with AWS Profile
Section titled “Recommended Usage with AWS Profile”See the AWS CLI setup section above for configuring your AWS credentials.
Then export AWS_PROFILE=ocean and run this code:
import obstore as obsfrom obstore.store import S3Storefrom boto3 import Sessionfrom obstore.auth.boto3 import Boto3CredentialProvider
# Use AWS profile for authentication (recommended)boto_session = Session()credential_provider = Boto3CredentialProvider(boto_session)
store = S3Store( "<org-slug>", # The org and dataset slug can be retrieved on the dataset page prefix="<dataset-slug>", endpoint="https://ocean.humannative.ai", virtual_hosted_style_request=False, credential_provider=credential_provider,)
# List filesfiles = obs.list(store).collect()print(files)
# Download a fileresp = obs.get(store, "example.mp3")print(resp.bytes())Advanced Example: Downloading Entire Datasets
Section titled “Advanced Example: Downloading Entire Datasets”For downloading complete datasets with concurrent downloads:
import asynciofrom pathlib import Pathfrom obstore.store import S3Storefrom boto3 import Sessionfrom obstore.auth.boto3 import Boto3CredentialProviderimport timefrom datetime import timedeltafrom typing import TYPE_CHECKING
if TYPE_CHECKING: from obstore.store import ClientConfig
async def download_file(store: S3Store, obj_meta: dict, local_file_path: Path, semaphore: asyncio.Semaphore): """Download a single file with concurrency control""" async with semaphore: try: print(f"Downloading {obj_meta['path']} to {local_file_path} ({obj_meta['size']} bytes)...") get_result = await store.get_async(obj_meta["path"]) with open(local_file_path, "wb") as f: f.write(get_result.bytes()) print(f"Successfully downloaded {local_file_path}") except Exception as e: print(f"Error downloading {obj_meta['path']}: {e}")
async def download_dataset(org_slug: str, dataset_slug: str, local_path: str = ".", concurrency: int = 2): """Download an entire dataset from Ocean"""
# Setup authentication boto_session = Session() credential_provider = Boto3CredentialProvider(boto_session)
client_config: ClientConfig = {} client_config["timeout"] = "2h"
# Create store store = S3Store( bucket=org_slug, prefix=dataset_slug, credential_provider=credential_provider, client_options=client_config, endpoint="https://ocean.humannative.ai", virtual_hosted_style_request=False, )
# Create local directory local_download_path = Path(local_path) / dataset_slug local_download_path.mkdir(parents=True, exist_ok=True)
print(f"Listing objects for dataset {dataset_slug} from organization {org_slug}...")
# List all objects objects = await store.list().collect_async() print(f"Found {len(objects)} objects")
if not objects: print("No objects found") return
# Create a semaphore to limit concurrent downloads semaphore = asyncio.Semaphore(concurrency)
download_tasks = []
for obj_meta in objects: # Skip directories (objects with size 0 and no extension) if obj_meta["size"] == 0 and "." not in obj_meta["path"]: print(f"Skipping directory: {obj_meta['path']}") continue
local_file_path = local_download_path / obj_meta["path"] local_file_path.parent.mkdir(parents=True, exist_ok=True)
task = download_file(store, obj_meta, local_file_path, semaphore) download_tasks.append(task)
print(f"\nStarting parallel downloads to {local_download_path}...") print(f"Concurrency: {concurrency}")
await asyncio.gather(*download_tasks) print("Download completed")
if __name__ == "__main__": # The org and dataset slug can be retrieved on the dataset page asyncio.run(download_dataset("<org-slug>", "<dataset-slug>", local_path="./downloads"))Dealing with timeouts
Section titled “Dealing with timeouts”Obstore has a client side timeout which is set to 30 seconds by default.
This might not be enough when downloading large files.
In order to increase it, you need to add a client_options map to the S3Store parameters.
store = S3Store( # ...existing params client_options={ "timeout": "10m", }, )Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Authentication Errors
- Verify your access key ID and secret are correct
- Ensure your AWS profile is properly configured
- Check that
AWS_PROFILE=oceanis set if using profiles
Connection Issues
- Verify the endpoint URL:
https://ocean.humannative.ai - Ensure
virtual_hosted_style_request=Falseis set - Check your network connectivity
Permission Errors
- Confirm you have access to the specified organization and dataset
- Verify your access key has the necessary permissions