PdfPagePropertiesService.cs

Service that extracts page properties (title and description) from PDF files using font size and position analysis.

What it does

Analyzes the first page of a PDF to extract structured page properties:

Opens PDF files from file path or stream
Extracts text with position and font size data
Groups words into lines by Y-coordinate
Identifies title from largest font text at top of page
Identifies description from next text block below title

Interface

public interface IPdfPagePropertiesService
{
    PdfPageProperties ExtractFromFile(string filePath);
    PdfPageProperties ExtractFromStream(Stream stream);
    PdfSectionResult ExtractSectionByHeading(string filePath, string headingText);
    PdfMarkdownResult ExtractAsMarkdown(string filePath);
    ExtractionResult ExtractSections(string filePath, PdfExtractionRules rules);
}

Result classes

public class PdfPageProperties
{
    public string Title { get; set; }        // Document title from largest font
    public string Description { get; set; }  // Description from text below title
    public string? Error { get; set; }       // Error message if extraction failed
}

public class PdfSectionResult
{
    public string Heading { get; set; }      // The heading that was found
    public string Content { get; set; }      // The content text extracted under the heading
    public string? Error { get; set; }       // Error message if extraction failed
}

public class PdfMarkdownResult
{
    public string Title { get; set; }        // Document title (H1)
    public string Subtitle { get; set; }     // Document subtitle/description
    public string Markdown { get; set; }     // Full content as Markdown
    public string RawText { get; set; }      // Raw text for debugging
    public string? Error { get; set; }       // Error message if extraction failed
}

public class ExtractionResult
{
    public Dictionary<string, string> Sections { get; set; } = new();  // Keyed by type: title, description, content
    public string RawText { get; set; } = string.Empty;                // Raw text with font sizes for debugging
    public string? Error { get; set; }                                 // Error message if extraction failed
}

Property mapping

The extracted properties map to Umbraco page fields:

Extracted Property	Umbraco Field(s)
Title	Page Title, Page Title Short, Node Name
Description	Page Description

Key concepts

Title extraction

Groups words into lines by Y-coordinate proximity
Calculates average font size per line
Identifies largest font size in document
Collects consecutive lines from top with similar large font
Combines into single title string

Description extraction

Finds first non-title line with content
Collects lines with similar font size (15% tolerance)
Limits to 3 lines maximum
Combines into description string

Font size estimation

Uses word bounding box height as a proxy for font size since PDF Pig’s GetWords() provides reliable bounding boxes.

Section extraction by heading

The ExtractSectionByHeading method extracts content from a specific section:

Searches all pages for a line containing the heading text (case-insensitive)
Captures all text lines below the heading
Stops when it encounters another heading (similar or larger font size)
Returns the heading found and the content as plain text

Special handling for itinerary extraction: When the heading matches a “Day N” pattern (e.g., “Day 1”), the extraction continues through all subsequent “Day N” headings (Day 2, Day 3, etc.) until a non-Day heading is encountered. This allows extracting complete itineraries that span multiple days.

// Example: Extract full itinerary by searching for "Day 1"
var result = _pagePropertiesService.ExtractSectionByHeading(path, "Day 1");
// Result.Content will contain Day 1, Day 2, Day 3... until end of itinerary

Markdown extraction with column detection

The ExtractAsMarkdown method extracts the full PDF content as Markdown, with intelligent column detection for multi-column layouts common in travel brochures and itineraries:

Analyzes word positions to detect column boundaries
Merges multi-column content into a single flow
Identifies headings by font size and applies Markdown heading levels
Preserves paragraph structure with proper spacing

// Example: Extract full PDF as Markdown
var result = _pagePropertiesService.ExtractAsMarkdown(path);
// Result.Title = "The Castles and Gardens of Kent"
// Result.Subtitle = "5 days from £889"
// Result.Markdown = "## Day 1\n\nArrive at...\n\n## Day 2\n\nVisit..."

Map-driven section extraction (ExtractSections)

The ExtractSections method extracts structured sections from a PDF using rules defined in a PdfExtractionRules object (from a map file). Instead of hardcoded patterns, it uses configurable rules for title detection, description matching, content start/stop boundaries, and heading levels.

var result = _pagePropertiesService.ExtractSections(path, mapFile.SourceTypes.Pdf.Extraction);
// result.Sections["title"] = "The Castles and Gardens of Kent"
// result.Sections["description"] = "5 days from £889"
// result.Sections["content"] = "## Day 1\n\nArrive at..."

The extraction process:

Extracts text lines from all pages (with optional column detection filtering)
Identifies the title using the TitleDetection.FontSizeThreshold ratio against the largest font
Matches the description using the DescriptionPattern regex
Captures content between the Content.StartPattern regex and any Content.StopPatterns
Formats content headings using the configured Content.HeadingLevel (h1-h4)
Returns a Dictionary<string, string> with keys: title, description, content

The private ExtractSectionsFromDocument method implements this logic, using Regex objects built from the map file rules instead of the hardcoded patterns used by ExtractAsMarkdown.

Usage

public class MyController : ControllerBase
{
    private readonly IPdfPagePropertiesService _pagePropertiesService;

    public MyController(IPdfPagePropertiesService pagePropertiesService)
    {
        _pagePropertiesService = pagePropertiesService;
    }

    public IActionResult GetPageProperties(string path)
    {
        var result = _pagePropertiesService.ExtractFromFile(path);

        if (!string.IsNullOrEmpty(result.Error))
            return BadRequest(result.Error);

        return Ok(new
        {
            title = result.Title,
            description = result.Description
        });
    }
}

Rule condition matching (MatchesCondition)

The static MatchesCondition method evaluates a single rule condition against a PDF element. Used by ContentTransformService during the Shape layer to determine which rule matches each element.

public static bool MatchesCondition(PdfElement element, RuleCondition condition, int index, int total)

Supported condition types:

Type	Matching Logic
`fontSizeEquals`	`Math.Abs(element.FontSize - value) <= 0.5`
`fontSizeRange`	`fontSize >= min && fontSize <= max` (parses `{ min, max }` JSON object)
`fontSizeAbove`	`element.FontSize > value`
`fontSizeBelow`	`element.FontSize < value`
`fontNameContains`	Case-insensitive substring match on font name
`colorEquals`	Case-insensitive hex color comparison
`isBoldEquals`	Compares `element.IsBold` to parsed boolean value
`htmlTagEquals`	Case-insensitive match on `element.HtmlTag`
`cssClassContains`	Case-insensitive substring match on `element.CssClasses`
`htmlContainerPathContains`	Case-insensitive substring match on `element.ContainerPath`
`positionFirst`	`index == 0`
`positionLast`	`index == total - 1`
`textBeginsWith`	Case-insensitive `StartsWith` on element text
`textEndsWith`	Case-insensitive `EndsWith` on element text
`textContains`	Case-insensitive `Contains` on element text
`textMatchesPattern`	Regex match on element text

The fontSizeRange condition accepts a JSON object { "min": 9, "max": 13 } and is preferred over fontSizeEquals for cross-PDF compatibility, since font metrics can vary by 1-3 points across PDFs from the same template.

Registration

Registered via UpDocComposer as a scoped service.