Migrating from WordPress to Pelican with Python

Migrating from WordPress to Pelican with Python: Building HTML to Markdown Extraction Tools

Recently, I finished migrating this blog from WordPress to Pelican, a Python-based static site generator. Instead of manually converting dozens of posts, I built a collection of Python scripts to automate the extraction and conversion process. The result is a reusable tool that I've open-sourced for anyone facing a similar migration.

The Challenge

WordPress exports content as HTML with embedded metadata, while Pelican expects Markdown files with YAML front matter. The structure differences meant I needed to:

  1. Extract metadata (title, date, category, tags) from WordPress HTML
  2. Convert HTML content to clean Markdown
  3. Generate SEO-friendly slugs, which I have set to match my filenames

The Solution: Three Focused Scripts

Rather than building one monolithic converter, I created three specialized scripts that can work independently or together:

1. Metadata Extraction (extract_html_metadata.py)

This script parses WordPress HTML to extract structured metadata:

# Extracts and formats as Pelican front matter
python extract_html_metadata.py wordpress-post.html

It looks for specific WordPress elements like <time class="entry-date published"> for dates and <a rel="category tag"> for taxonomies, then formats them as clean YAML front matter.

2. Content Conversion (extract_html_body.py)

Focuses solely on extracting and converting the main article content:

# Converts HTML content to Markdown
python extract_html_body.py wordpress-post.html

This script isolates the <div class="entry-content"> section and converts the HTML to clean Markdown using proper parsing techniques.

3. Complete Conversion (html_to_markdown.py)

The main orchestrator that combines metadata and content extraction:

# Full conversion with automatic filename generation
python html_to_markdown.py wordpress-post.html

This creates complete Pelican-ready Markdown files in the correct directory structure with SEO-friendly filenames.

Key Technical Decisions

Beautiful Soup for Parsing: Rather than regex or basic string manipulation, I used Beautiful Soup for robust HTML parsing. This handles malformed HTML gracefully and provides reliable element selection.

Modular Design: Splitting functionality into three scripts allows for flexible usage. You might only need metadata extraction, or want to customize the content conversion while using the existing metadata logic.

Pelican Compatibility: The output format exactly matches Pelican's expected front matter and directory structure, making the converted files immediately usable.

Installation and Usage

The tool is available on GitHub with both uv and pip installation options:

# Using uv (recommended)
git clone https://github.com/thomaslangston/simple-html-wordpress-to-pelican-markdown-extractor
cd simple-html-wordpress-to-pelican-markdown-extractor
uv sync

# Basic usage
python html_to_markdown.py my-wordpress-post.html

Lessons Learned

  1. HTML Structure Varies: WordPress themes generate different HTML structures. The scripts target common patterns but may need adjustment for custom themes.

  2. Content Cleanup: Automated conversion gets you 90% there, but manual review is still valuable for formatting edge cases and WordPress shortcodes.

  3. Batch Processing: While I built these for individual file conversion, adding batch processing capability would be a natural next step.

Open Source Release

The complete toolset is available at simple-html-wordpress-to-pelican-markdown-extractor under the MIT license. The code is designed to be readable and modifiable for different WordPress configurations.

If you're facing a similar migration, these tools provide a solid foundation that you can adapt to your specific needs. The modular design means you can use just the pieces you need or extend the functionality for your particular WordPress setup.

Static site generators like Pelican offer compelling advantages in performance, security, and simplicity compared to dynamic WordPress sites. With the right extraction tools, the migration process becomes much more manageable.

social