Migrate from Confluence XHTML to Asciidoctor

You can convert Atlassian Confluence XHTML pages to Asciidoctor using this Groovy script.

The script calls Pandoc to convert single or multiple HTML files exported from Confluence to AsciiDoc files. You’ll need Pandoc installed before running this script. If you have trouble running this script, you can use the Pandoc command referenced inside the script to convert XHTML files to AsciiDoc manually.

Example 1. convert.groovy
// This script is provided by melix.
// The source can be found at https://gist.github.com/melix/6020336

@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.4')
import org.htmlcleaner.*

def src = new File('html').toPath()
def dst = new File('asciidoc').toPath()

def cleaner = new HtmlCleaner()
def props = cleaner.properties
props.translateSpecialEntities = false
def serializer = new SimpleHtmlSerializer(props)

src.toFile().eachFileRecurse { f ->
    def relative = src.relativize(f.toPath())
    def target = dst.resolve(relative)
    if (f.isDirectory()) {
        target.toFile().mkdir()
    } else if (f.name.endsWith('.html')) {
        def tmpHtml = File.createTempFile('clean', 'html')
        println "Converting $relative"
        def result = cleaner.clean(f)
        result.traverse({ tagNode, htmlNode ->
                tagNode?.attributes?.remove 'class'
                if ('td' == tagNode?.name || 'th'==tagNode?.name) {
                    tagNode.name='td'
                    String txt = tagNode.text
                    tagNode.removeAllChildren()
                    tagNode.insertChild(0, new ContentNode(txt))
                }

            true
        } as TagNodeVisitor)
        serializer.writeToFile(
                result, tmpHtml.absolutePath, "utf-8"
        )
        "pandoc -f html -t asciidoc -R -S --normalize -s $tmpHtml -o ${target}.adoc".execute().waitFor()
        tmpHtml.delete()
    }/* else {
        "cp html/$relative $target".execute()
    }*/
}

This script was created by Cédric Champeau (melix). You can find the source of this script hosted at this gist.

The script is designed to be run locally on HTML files or directories containing HTML files exported from Confluence.

Usage

  1. Save the script contents to a convert.groovy file in a working directory.

  2. Make the file executable according to your specific OS requirements.

  3. Place individual files, or a directory containing files into the working directory.

  4. Run groovy convert filename.html to convert a single file.

  5. Confirm the output file meets requirements

  6. Recurse through a directory by using this command pattern: groovy convert directory/*.html