Textract

Protocol: JSON 1.1 Header: X-Amz-Target: Textract.<Action>

Floci emulates the AWS Textract API with a dummy response stub. The response shape matches the real AWS Textract contracts so AWS SDK and CLI clients accept the reply without error. No real OCR or document analysis is performed: every call returns a fixed set of Block objects with synthetic metadata.

Supported Operations

Operation	Notes
`DetectDocumentText`	Returns stub PAGE + LINE + WORD blocks
`AnalyzeDocument`	Returns stub blocks; `FeatureTypes` accepted but ignored
`StartDocumentTextDetection`	Returns a `JobId`; job is immediately SUCCEEDED
`GetDocumentTextDetection`	Returns `SUCCEEDED` + stub blocks for a known `JobId`
`StartDocumentAnalysis`	Returns a `JobId`; job is immediately SUCCEEDED
`GetDocumentAnalysis`	Returns `SUCCEEDED` + stub blocks for a known `JobId`

Document and DocumentLocation inputs (bytes or S3 references) are accepted but not parsed.

Block shape

Each response includes a 3-block hierarchy matching the AWS Block API shape:

BlockType	Text	Relationships
`PAGE`	(none)	CHILD → LINE
`LINE`	`"Floci"`	CHILD → WORD
`WORD`	`"Floci"`	(none)

Every block includes: Id (UUID), Confidence (99.9), Page (1), and a Geometry with BoundingBox + 4-point Polygon.

Async job lifecycle

Start* operations store a job ID in memory and return it immediately. Get* calls with a valid job ID always return JobStatus: SUCCEEDED. Job IDs are not persisted across restarts. Using a GetDocumentTextDetection job ID in GetDocumentAnalysis (or vice-versa) returns InvalidJobIdException.

Configuration

Variable	Default	Description
`FLOCI_SERVICES_TEXTRACT_ENABLED`	`true`	Enable or disable the service

Examples

export AWS_ENDPOINT_URL=http://localhost:4566
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test

# DetectDocumentText
aws textract detect-document-text \
  --document '{"S3Object":{"Bucket":"my-bucket","Name":"test.pdf"}}'

# AnalyzeDocument
aws textract analyze-document \
  --document '{"S3Object":{"Bucket":"my-bucket","Name":"test.pdf"}}' \
  --feature-types TABLES FORMS

# Async: start + poll
JOB_ID=$(aws textract start-document-text-detection \
  --document-location '{"S3Object":{"Bucket":"my-bucket","Name":"test.pdf"}}' \
  --query JobId --output text)

aws textract get-document-text-detection --job-id "$JOB_ID"

import boto3

client = boto3.client("textract", endpoint_url="http://localhost:4566")

# Sync
resp = client.detect_document_text(
    Document={"S3Object": {"Bucket": "my-bucket", "Name": "test.pdf"}}
)
for block in resp["Blocks"]:
    print(block["BlockType"], block.get("Text", ""))

# Async
job = client.start_document_text_detection(
    DocumentLocation={"S3Object": {"Bucket": "my-bucket", "Name": "test.pdf"}}
)
result = client.get_document_text_detection(JobId=job["JobId"])
print(result["JobStatus"])  # SUCCEEDED

Out of Scope

Real OCR or document analysis (always returns a fixed stub block list).
AnalyzeExpense, AnalyzeID, AnalyzeLendingDocument and other specialized analysis operations.
GetAdapterVersion, CreateAdapter, ListAdapters (Adapter management API).
GetDocumentTextDetection / GetDocumentAnalysis pagination via NextToken.
Persistent job storage across restarts.