Skip to content

MarkdownExtractionService.cs

Service that extracts structured sections from Markdown files using strategy-based extraction.

Reads a Markdown file from disk, splits it into lines, and applies extraction strategies defined in a SourceConfig to produce named sections. Returns the same ExtractionResult as the PDF extraction service, making source types interchangeable.

public interface IMarkdownExtractionService
{
ExtractionResult ExtractSectionsFromConfig(string filePath, SourceConfig sourceConfig);
}
StrategyImplementation
firstHeadingFinds the first line matching #{level} prefix. Returns heading text without the # prefix.
firstParagraphFinds the first non-empty, non-heading line after a heading. Returns the text.
betweenPatternsScans lines, starts capturing at startPattern regex, stops at any stopPatterns regex. Respects includeStartLine.
regexMatches a regex pattern against each line. Returns the first match (supports capture groups).

Markdown is already structured text — no PdfPig, no column detection, no font analysis needed. The service is simple line-based text parsing (~170 lines).

  • level (int) — Heading level to match (1 = #, 2 = ##, etc.). Default: 1.
  • startPattern (string) — Regex to match the start line
  • stopPatterns (string[]) — Regexes that stop capturing when matched
  • includeStartLine (bool) — Whether to include the matched start line. Default: true.
  • pattern (string) — Regex pattern to match
  • flags (string) — Regex flags (e.g. "i" for case-insensitive)
  • captureGroup (int) — Which capture group to return. Default: 0 (full match).

Registered as scoped via UpDocComposer:

builder.Services.AddScoped<IMarkdownExtractionService, MarkdownExtractionService>();
  • ILogger<MarkdownExtractionService> — for logging extraction progress and errors
namespace UpDoc.Services;