Welcome back to Beyond Content Strategy. This month, I want to share an important content strategy resource and two content processing tools with you.
I have relied on the federal government’s Plain Language guidelines (plainlanguage.gov) for more than a decade. Unfortunately, the website was recently shut down.
For those unfamiliar with the site, plainlanguage.gov served as the authoritative resource for writing clear, accessible content for government and public-facing websites. It provided guidelines, training materials, and examples of how to write content that serves all users—not just those with advanced reading levels or specialized knowledge. The site wasn’t just a government resource; it was a foundational reference for content strategists across all sectors who understood that clarity isn’t about “dumbing down” content—it’s about respecting users’ time and cognitive load.
As content strategists, we occupy a unique position in the web development ecosystem. We’re the ones who advocate for users while translating technical complexity into accessible experiences. We bridge the gap between engineering, design, and business objectives. Plain language principles aren’t peripheral to our work—they’re central to our responsibility as the voice for user experience in content systems architecture.
The loss of plainlanguage.gov as an active, maintained resource is concerning, but the principles and guidance haven’t disappeared. The entire site remains accessible through the Internet Archive, preserving years of best practices, training materials, and examples. Additionally, the site’s code and content live on in a public repository on GitHub, ensuring the information remains available to our community.
Why does this matter enough to lead this month’s newsletter? Because plain language represents everything we advocate for as content strategists: clarity over cleverness, user needs over organizational convenience, accessibility as a baseline rather than an enhancement. When we design content models, build taxonomies, or architect CMS systems, we’re not just organizing information—we’re creating pathways for human understanding. Plain language principles should inform every technical decision we make.
Making Your Own Personal Plain Language Reference
Here’s where things get interesting—and where we can demonstrate the kind of technical fluency that sets systems-thinking content strategists apart. The plainlanguage.gov GitHub repository isn’t just a backup; it’s a collection of markdown files that we can transform into whatever format serves our needs best.
Understanding Markdown: Your Secret Weapon
If you’re not already familiar with markdown, now is the perfect time to add it to your toolkit. Markdown is a lightweight markup language—essentially a way of formatting text using simple, readable syntax that can be converted into HTML or other formats. Instead of clicking formatting buttons or writing complex HTML tags, you use simple characters to indicate structure:
# This becomes a heading **This becomes bold** - This becomes a bullet point [This becomes a link](url)
Why does this matter for content strategists? Because markdown separates content from presentation in a way that’s both human-readable and machine-processable. When you write in markdown, you’re creating structured content that can be transformed into websites, PDFs, Word documents, or any other format you need. It’s the same principle we advocate for when designing content models—create once, publish everywhere.
Markdown has become the lingua franca of AI. Every major AI assistant—ChatGPT, Claude, Google Gemini, Microsoft Copilot, Perplexity, and others—uses markdown as its primary format for structuring responses. When you ask an AI to format text, create documentation, or organize information, it returns markdown. When you feed content into AI tools for analysis or processing, markdown provides the cleanest, most reliable input format. Understanding markdown means you’re speaking the native language of AI tools, which becomes increasingly valuable as we integrate AI into content workflows.
The Plain Language GitHub repository uses markdown files (.md) for all its content pages. This means the content is already in a structured, portable format. We’re not dealing with a static website backup—we have the raw, reusable content itself.
Enter Pandoc: The Universal Document Converter
Pandoc is a command-line tool that converts documents between formats. Think of it as the Swiss Army knife of document conversion—it can transform markdown into Word documents, PDFs, HTML, ePub, and dozens of other formats while preserving structure and formatting.
For content strategists, pandoc is invaluable because it automates the kind of format conversion we often do manually. Need to turn a markdown content brief into a client-ready PDF? Pandoc. Want to convert documentation into a Word template? Pandoc. Need to generate HTML from structured content files? Pandoc.
I’ve been using Pandoc to convert documents from one format to another for years. When I was developing tools in Filemaker for my content strategy work, I used Pandoc to convert reports from Filemaker into Word documents. In more recent projects, I’ve used it with Python scripts to convert the output into client-friendly Word, Excel and PDF documents.
Creating Your Personal Plain Language Reference Guide
Let’s walk through how to use the Plain Language GitHub repository with pandoc to create a comprehensive reference document you can keep on your desktop, share with your team, or customize for your organization’s needs.
What you’ll need:
- Git (to download the repository)
- Pandoc (to convert the markdown files)
- A terminal/command prompt (we’ll keep this simple)
Step 1: Clone the Repository
Open your terminal (Mac) or command prompt (Windows) and navigate to where you want to store the files:
git clone https://github.com/GSA/plainlanguage.gov.git cd plainlanguage.gov
If you don’t have git installed or prefer not to use the command line, you can also visit the GitHub page and click the green “Code” button, then select “Download ZIP.” Extract the files to a folder on your computer.
Step 2: Explore the Markdown Files
Navigate to the _pages folder in the repository. This is where the main content lives. Open any .md file in a text editor (even Notepad works) to see the markdown format. Notice how readable it is—that’s the power of markdown. You’re looking at structured content that’s both human-readable and machine-processable.
Step 3: Install Pandoc
Visit https://pandoc.org/installing.html and follow the installation instructions for your operating system.
- Mac users: The easiest method is using Homebrew: brew install pandoc
- Windows users: Download the installer from the pandoc website
To verify installation, type pandoc –version in your terminal.
Step 4: Convert Files to Your Preferred Format
Now comes the magic. Here’s how to convert a single markdown file to different formats:
To create a Word document:
pandoc guidelines/words/use-simple-words-phrases.md -o plain-language-simple-words.docx
To create a PDF:
pandoc guidelines/words/use-simple-words-phrases.md -o plain-language-simple-words.pdf
To combine multiple files into one comprehensive document:
pandoc _pages/guidelines/*.md -o plain-language-complete-guide.docx
Step 5: Customize Your Reference Document
Here’s where your content strategy skills come into play. You can:
- Combine specific sections relevant to your work (e.g., only web writing guidelines)
- Add your own organizational guidelines by creating additional markdown files
- Customize formatting using pandoc’s template system
- Create different versions for different audiences (client-facing vs. internal team reference)
Advanced Pandoc Techniques for Content Strategists
Once you’re comfortable with basic conversion, pandoc offers powerful options:
Add a table of contents:
pandoc input.md -o output.docx --toc
Set a specific reference style:
pandoc input.md -o output.docx --reference-doc=my-template.docx
Combine multiple markdown files with chapter breaks:
pandoc chapter1.md chapter2.md chapter3.md -o complete-guide.pdf --toc --number-sections
Why This Matters Beyond Plain Language
This exercise isn’t just about preserving access to Plain Language guidelines—though that alone makes it worthwhile. It’s about understanding how content strategists can leverage technical tools to create, manage, and transform content at scale.
When you understand markdown and pandoc, you can:
- Create documentation systems that output to multiple formats from a single source
- Build style guides that live as structured markdown files and generate branded PDFs
- Develop content templates that writers can use in any editor
- Automate content transformation between systems during migrations
- Maintain version-controlled content that tracks changes over time
This is exactly the kind of technical fluency that positions content strategists as systems architects rather than just content creators. You’re not just organizing information—you’re building flexible, scalable content systems that adapt to changing needs.
A Living Reference You Control
By creating your own Plain Language reference document, you’ve done more than preserve important guidelines. You’ve demonstrated a core principle of modern content strategy: content should be structured in ways that make it reusable, transformable, and sustainable. You’ve taken content from one context (a government website) and adapted it to serve your specific needs (a desktop reference, team training document, or client resource).
This is the kind of thinking we’ll explore throughout this issue as we look at real projects and practical tools. The technical skills you’re building—whether it’s using pandoc, writing in markdown, or understanding content structure—compound over time to make you exponentially more valuable to your organization and clients.
Quick Win: Spend 30 minutes this week downloading the Plain Language repository and converting one section to PDF. Choose a section you reference frequently in your work. Now you have a portable, offline reference you can share with colleagues or reference during client meetings.
From Basic Extraction to Intelligent Document Processing: Meet DocStrange
SIDEBAR: Installing Python for DocStrange
To use DocStrange locally, you’ll need Python installed on your computer. If you already have Python 3.8 or later, you’re all set. If not, here’s where to start:
Check if you have Python: Open your terminal (Mac) or command prompt (Windows) and type:
python --version
or
python3 --version
If you see a version number 3.8 or higher, you’re ready to go. If not, you’ll need to install Python.
Installation Resources:
For Windows users:
- Official Python installer: https://www.python.org/downloads/windows/
- Detailed tutorial: https://realpython.com/installing-python/#windows
- Choose “Add Python to PATH” during installation (this is important!)
For Mac users:
- Official Python installer: https://www.python.org/downloads/macos/
- Detailed tutorial: https://realpython.com/installing-python/#macos
- Mac users can also use Homebrew: brew install python3
After installation, verify by checking the version again. Once Python is installed, you can install DocStrange with a simple pip install docstrange command.
Now that we’ve explored pandoc for converting between formats—including basic PDF text extraction—let’s look at a specialized tool that takes document processing to the next level: DocStrange.
The Document Intelligence Problem
Yes, pandoc can convert PDFs to markdown with a simple command:
pandoc input.pdf -o output.md
And for straightforward PDFs with selectable text and simple layouts, this works fine. But as content strategists, we regularly encounter documents that resist simple extraction:
- Scanned documents with no selectable text
- Complex multi-column layouts where text extraction loses logical reading order
- Documents with tables that need structure preserved
- Forms and invoices with specific data fields we need to extract
- Legacy PDFs with embedded images, mixed layouts, and inconsistent formatting
- Image-based content like screenshots of content that needs migration
This is where pandoc’s basic text extraction falls short, and where DocStrange’s AI-powered document intelligence becomes invaluable.
What DocStrange Does Differently
DocStrange is a Python library developed by NanoNets that uses AI and advanced OCR to not just extract text, but understand document structure and meaning. Here’s the fundamental difference:
- Pandoc PDF extraction pulls text from PDFs linearly, treating the document as a text stream
- DocStrange analyzes document layout visually, understands structure, and intelligently reconstructs content with hierarchy and meaning intact
Think of it as the difference between copying text from a PDF versus understanding what that text represents in context.
Key Capabilities for Content Strategists
1. OCR for Image-Based Content
DocStrange can extract text from scanned PDFs and images—something pandoc cannot do at all:
# Extract text from a scanned document
docstrange scanned-manual.pdf --output markdown
This alone makes it essential for content migration projects where legacy documentation exists only as scans or images.
2. Intelligent Table Preservation
When pandoc extracts tables from PDFs, you often get garbled text that’s lost its structure. DocStrange recognizes tables visually and converts them to proper markdown tables or HTML:
from docstrange import DocumentExtractor
# Extract with table structure preserved
result = DocumentExtractor().extract("report-with-tables.pdf")
markdown = result.extract_markdown()
# Tables preserved as markdown
For content audits involving product specifications, comparison charts, or data tables, this structure preservation is critical.
3. Structured Data Field Extraction
Beyond converting entire documents, DocStrange can extract specific fields from forms, invoices, and structured documents:
from docstrange import DocumentExtractor
extractor = DocumentExtractor()
# Extract specific fields from an invoice
result = extractor.extract( "invoice.pdf", extract_fields=["invoice_number", "total_amount", "due_date", "vendor_name"] ) print(result.extract_json())
This is invaluable when auditing content systems with forms or semi-structured content that needs to be mapped to a CMS content model.
4. Layout-Aware Text Extraction
DocStrange understands document layout visually—multi-column articles, sidebars, callout boxes—and extracts text in logical reading order. Pandoc extracts text as it appears in the PDF structure, which may not match visual layout.
5. LLM-Optimized Output
DocStrange specifically formats output to work seamlessly with AI tools. The markdown it generates is clean, hierarchical, and ready to feed into ChatGPT, Claude, or other LLMs for analysis, categorization, or content processing.
6. Multiple Output Formats
Like pandoc, DocStrange outputs to various formats:
# Convert to markdown
docstrange document.pdf --output markdown
# Extract structured data as JSON
docstrange form.pdf --output json
# Export tables to CSV
docstrange spreadsheet.pdf --output csv
# Generate HTML
docstrange report.pdf --output html
7. Privacy Options
DocStrange offers two processing modes:
- Cloud mode: Free for up to 10,000 documents per month with zero setup
- Local mode: 100% private processing on your own CPU or GPU
# Process locally for sensitive documents
docstrange confidential.pdf --cpu-mode --output markdown
For content strategists working with client materials or confidential documents, local processing is essential.
When to Use Pandoc vs. DocStrange for PDFs
Understanding when to use each tool is crucial for efficiency:
Use Pandoc for PDF extraction when:
- The PDF has clean, selectable text
- The layout is simple (single column, straightforward)
- You just need basic text content
- There are no complex tables
- You’re doing a quick one-off conversion
- The PDF is essentially a text document saved as PDF
Use DocStrange for PDF extraction when:
- The PDF is scanned or image-based (no selectable text)
- The document has multi-column or complex layouts
- You need to preserve table structure
- You’re extracting specific data fields (forms, invoices)
- The document mixes text, images, and structured data
- You’re processing many documents and need consistent output
- You need output optimized for AI processing
- Text extraction order matters (reading flow, hierarchy)
Practical Applications for Content Strategists
- Content Audits with Legacy PDFs: Extract content from documentation that exists only as PDFs to analyze patterns, identify duplicates, or categorize using AI tools.
- Migration Projects: Convert content from systems that only export to PDF into structured formats you can map to your new CMS content model.
- Form Analysis: Extract field structures and data patterns from existing PDF forms to inform new form design in your CMS.
- Documentation Extraction: Pull structured content from user guides, product sheets, or technical documentation where you need tables and hierarchy preserved.
- Scanned Content Recovery: Process scanned historical documents that need to be migrated into modern content systems.
- Structured Data Extraction: Pull specific fields from invoices, contracts, or forms to understand content patterns and design appropriate content types.
Getting Started with DocStrange
Installation is straightforward:
pip install docstrange
Simplest possible use (cloud mode, no API key needed):
from docstrange import DocumentExtractor
# Extract to markdown
result = DocumentExtractor().extract("document.pdf")
print(result.extract_markdown())
For local, private processing:
docstrange document.pdf --cpu-mode --output markdown
Command-line batch processing:
# Process all PDFs in a directory
for file in *.pdf;
do docstrange "$file" --output markdown --output-file "${file%.pdf}.md"
done
Combining Pandoc and DocStrange in Your Workflow
The real power comes from using both tools appropriately:
For the Plain Language Repository:
- Use pandoc to convert the clean markdown files to Word docs or PDFs
- The content is already well-structured markdown, so pandoc is perfect
For Legacy Content Migration:
- Use DocStrange to extract content from complex PDFs into markdown
- Then use pandoc to convert that markdown to whatever format stakeholders need
Example workflow for a content audit:
# Step 1: Extract complex PDFs with DocStrange
docstrange legacy-docs/*.pdf --output markdown --output-dir ./extracted/
# Step 2: Feed markdown to AI for analysis and categorization
# (Use Claude or ChatGPT here to process the extracted content)
# Step 3: Create formatted stakeholder report with Pandoc
pandoc ./extracted/*.md --toc --number-sections -o content-audit-report.docx
The Technical Skillset Perspective
Understanding the nuances between pandoc and DocStrange—knowing that both can work with PDFs but in fundamentally different ways—is exactly the kind of technical fluency that distinguishes content systems architects from traditional content strategists.
You’re not just saying “convert this PDF.” You’re assessing:
- What’s the document structure and complexity?
- Is this text-based or image-based?
- Do I need table structure preserved?
- Am I processing one document or thousands?
- What format do I need for the next step in my workflow?
This systems thinking—choosing the right tool for the specific problem—is what allows you to:
- Automate tasks that would take weeks manually
- Handle complex migrations with confidence
- Scale your content auditing work efficiently
- Provide data-driven insights to stakeholders
- Position yourself as someone who solves problems systematically
These technical capabilities transform how you approach content work, moving from manual document handling to building intelligent, scalable content processing workflows.
Quick Win: Find a complex PDF in your files—something with tables, multi-column layout, or scanned pages. Try extracting it with both pandoc and DocStrange. Compare the results. You’ll immediately see where each tool excels and understand which to reach for in different situations.