Skip to content

Athena

Protocol: JSON 1.1 Endpoint: http://localhost:4566/

Floci emulates Amazon Athena with real SQL execution powered by a floci-duck sidecar container running DuckDB. When a query is submitted, Floci spins up the sidecar on first use, injects CREATE OR REPLACE VIEW statements for each Glue-registered table pointing to S3 data, then executes the SQL and stores results as CSV in S3.

Supported Actions

Action Description
StartQueryExecution Submits a SQL query; executed asynchronously via DuckDB
GetQueryExecution Returns query status (QUEUED, RUNNING, SUCCEEDED, FAILED)
GetQueryResults Returns the result set for a completed query
ListQueryExecutions Returns a list of past query executions
StopQueryExecution Cancels a running query
CreateWorkGroup Creates a new workgroup
GetWorkGroup Returns information about a workgroup
ListWorkGroups Lists all workgroups

How it works

  1. Lazy sidecar start: On the first StartQueryExecution call, Floci checks for a local floci/floci-duck:latest image and starts the container. Subsequent queries reuse the running container.
  2. Glue DDL injection: Floci reads all Glue tables for the target database and generates CREATE OR REPLACE VIEW statements mapping each table name to its S3 location via DuckDB's read_parquet, read_json_auto, or read_csv_auto functions — chosen based on the table's InputFormat or SerDe serialization library.
  3. Query execution: The user's SQL is wrapped in COPY (...) TO 's3://...' (FORMAT CSV, HEADER) and executed. Results are written directly to the output S3 path.
  4. Results retrieval: GetQueryResults reads the CSV back from S3 and returns it in the standard Athena ResultSet shape.

Format inference

The DuckDB read function is chosen from the Glue table's StorageDescriptor:

Condition Read function
InputFormat or SerializationLibrary contains parquet read_parquet
InputFormat or SerializationLibrary contains json read_json_auto
InputFormat contains hive read_json_auto
Anything else read_csv_auto

Configuration

Property Default Description
FLOCI_SERVICES_ATHENA_MOCK false Set to true to disable DuckDB execution — queries immediately succeed with empty results
FLOCI_SERVICES_DUCK_DEFAULT_IMAGE floci/floci-duck:latest DuckDB sidecar image pulled on first use
FLOCI_SERVICES_DUCK_URL (unset) Point to an existing floci-duck instance and skip container management

Example — simple query

export AWS_ENDPOINT_URL=http://localhost:4566

# Start a query
QUERY_ID=$(aws athena start-query-execution \
  --query-string "SELECT 42 AS answer" \
  --query 'QueryExecutionId' \
  --output text)

# Wait for completion
aws athena get-query-execution --query-execution-id $QUERY_ID

# Get results
aws athena get-query-results --query-execution-id $QUERY_ID

Example — data lake query (S3 + Glue + Athena)

export AWS_ENDPOINT_URL=http://localhost:4566

# 1. Create S3 bucket and upload data
aws s3 mb s3://my-data-lake
echo '{"id":1,"amount":10.0}
{"id":2,"amount":20.0}
{"id":3,"amount":30.0}' | aws s3 cp - s3://my-data-lake/orders/data.json

# 2. Register table in Glue
aws glue create-database --database-input '{"Name":"analytics"}'

aws glue create-table \
  --database-name analytics \
  --table-input '{
    "Name": "orders",
    "StorageDescriptor": {
      "Location": "s3://my-data-lake/orders/",
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "SerdeInfo": {
        "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "Columns": [
        {"Name": "id",     "Type": "int"},
        {"Name": "amount", "Type": "double"}
      ]
    }
  }'

# 3. Run Athena query
QUERY_ID=$(aws athena start-query-execution \
  --query-string "SELECT sum(amount) AS total FROM orders" \
  --query-execution-context Database=analytics \
  --query 'QueryExecutionId' \
  --output text)

# 4. Poll until done
while true; do
  STATE=$(aws athena get-query-execution \
    --query-execution-id $QUERY_ID \
    --query 'QueryExecution.Status.State' \
    --output text)
  [ "$STATE" = "SUCCEEDED" ] && break
  [ "$STATE" = "FAILED" ] && echo "Query failed" && exit 1
  sleep 1
done

# 5. Fetch results
aws athena get-query-results --query-execution-id $QUERY_ID

Shared sidecar with S3 Select

The floci-duck sidecar is shared between Athena and S3 Select. Once started by the first Athena query, it is also used by SelectObjectContent for CSV (with FileHeaderInfo=USE), JSON, and Parquet inputs. If Athena has not yet executed a query, S3 Select falls back to the built-in Java evaluator for CSV and JSON — Parquet always requires the sidecar.

See S3 Select for details on execution modes and supported SQL operators.

Mock mode

Set FLOCI_SERVICES_ATHENA_MOCK=true to skip DuckDB entirely for Athena. In this mode queries transition to SUCCEEDED immediately with an empty result set — useful for unit tests that only exercise the Athena state machine, not the query results.

When mock mode is enabled the sidecar does not start. S3 Select will use the Java evaluator for CSV and JSON. Parquet queries will fail unless FLOCI_SERVICES_DUCK_URL points to an already-running floci-duck instance.