PdfExtractionController.cs
API controller that exposes PDF extraction functionality to the Umbraco backoffice.
What it does
Section titled “What it does”Provides Management API endpoints for extracting content from PDFs:
- Accepts a media item’s unique ID (GUID)
- Retrieves the media item from Umbraco
- Gets the file path from the media properties
- Calls the PDF extraction service
- Returns the extracted text and page count
Endpoint
Section titled “Endpoint”GET /umbraco/management/api/v1/updoc/extract?mediaKey={guid}Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
| mediaKey | Guid | The unique identifier of the media item |
Response
Section titled “Response”{ "text": "Extracted text content...", "pageCount": 4}Error responses
Section titled “Error responses”404 Not Found- Media item not found or file not on disk400 Bad Request- Media has no file or extraction failed
Section extraction endpoint
Section titled “Section extraction endpoint”GET /umbraco/management/api/v1/updoc/page-section?mediaKey={guid}&heading={text}Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
| mediaKey | Guid | The unique identifier of the media item |
| heading | string | The heading text to search for (case-insensitive) |
Response
Section titled “Response”{ "heading": "Suggested Itinerary", "content": "Day 1: Arrive in...\nDay 2: Visit..."}Error responses
Section titled “Error responses”404 Not Found- Media item not found or file not on disk400 Bad Request- Heading not found or extraction failed
Markdown extraction endpoint
Section titled “Markdown extraction endpoint”GET /umbraco/management/api/v1/updoc/extract-markdown?mediaKey={guid}Extracts the full PDF content as Markdown with column detection for multi-column layouts.
Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
| mediaKey | Guid | The unique identifier of the media item |
Response
Section titled “Response”{ "title": "The Castles and Gardens of Kent", "subtitle": "5 days from £889", "markdown": "## Day 1\n\nArrive at...\n\n## Day 2\n\nVisit...", "rawText": "Full raw text for debugging"}Error responses
Section titled “Error responses”404 Not Found- Media item not found or file not on disk400 Bad Request- Extraction failed
Map file endpoint
Section titled “Map file endpoint”GET /umbraco/management/api/v1/updoc/maps/{blueprintId}Returns the map file configuration for a given blueprint.
Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
| blueprintId | Guid | The blueprint unique identifier |
Response
Section titled “Response”Returns the full MapFile JSON object (name, documentTypeAlias, blueprintId, sourceTypes, propertyMappings).
Error responses
Section titled “Error responses”404 Not Found— No map file found for the given blueprint
Section extraction endpoint (config-driven)
Section titled “Section extraction endpoint (config-driven)”GET /umbraco/management/api/v1/updoc/extract-sections?mediaKey={guid}&blueprintId={guid}&sourceType=pdfExtracts structured sections from a source document using the extraction rules defined in the config for the given blueprint. Routes to the correct extraction service based on sourceType.
Parameters
Section titled “Parameters”| Parameter | Type | Description |
|---|---|---|
| mediaKey | Guid | The unique identifier of the media item |
| blueprintId | Guid | The blueprint unique identifier (used to look up the config) |
| sourceType | string | The source type (pdf or markdown). Defaults to pdf. |
Response
Section titled “Response”{ "sections": { "title": "The Castles and Gardens of Kent", "description": "5 days from £889", "content": "## Day 1\n\nArrive at..." }, "propertyMappings": [ { "from": { "sectionType": "title" }, "to": { "property": "pageTitle", "alsoMapTo": ["pageTitleShort"] } } ]}The response includes both the extracted sections and the propertyMappings from the map file, so the frontend can apply each mapping to the document being created.
Error responses
Section titled “Error responses”404 Not Found— No map file found for the blueprint, or media item not found400 Bad Request— Map file has no PDF extraction rules, or extraction failed
Private helper: ResolveMediaFilePath
Section titled “Private helper: ResolveMediaFilePath”The new endpoints use a shared ResolveMediaFilePath(Guid mediaKey) private method that encapsulates the media lookup and file path resolution logic (handling both JSON and simple path formats for umbracoFile). Returns null if the media item is not found or the file does not exist on disk.
Key concepts
Section titled “Key concepts”Umbraco 17 Management API
Section titled “Umbraco 17 Management API”Uses the new Management API pattern introduced in Umbraco 14+:
[ApiVersion("1.0")]- API versioning[MapToApi("updoc")]- API grouping for Swagger[Authorize(Policy = AuthorizationPolicies.BackOfficeAccess)]- Backoffice auth[JsonOptionsName("UmbracoManagementApi")]- Umbraco JSON serialization
Media file path resolution
Section titled “Media file path resolution”The umbracoFile property can be stored as:
- Simple path:
/media/xxx/file.pdf - JSON object:
{"src":"/media/xxx/file.pdf"}
The controller handles both formats.
Dependencies
Section titled “Dependencies”IMediaService— Umbraco media serviceIPdfExtractionService— PDF extraction serviceIPdfPagePropertiesService— PDF page properties serviceIMarkdownExtractionService— Markdown extraction serviceIWorkflowService— Map file service for blueprint-to-config lookupsIWebHostEnvironment— For resolving file pathsILogger<PdfExtractionController>— Logging
Frontend usage
Section titled “Frontend usage”const response = await fetch( `/umbraco/management/api/v1/updoc/extract?mediaKey=${mediaUnique}`, { method: 'GET', headers: { 'Authorization': `Bearer ${token}`, }, });