Read The Times Australia

Daily Bulletin

Converting HTML to Markdown with Upmark

  • Written by: Josh Bassett, Data Platform Technical Lead, The Conversation

Here at The Conversation we run a Job Board that requires parsing a whole bunch of job descriptions in HTML and converting them to Markdown. When we originally built the Job Board we looked around for a HTML to Markdown converter library written in Ruby but unfortunately we couldn’t find one, so we built our own: Upmark.

Upmark allows you to easily convert HTML documents to Markdown format:

require "upmark"
html = "<p>messenger <strong>bag</strong> skateboard</p>"
markdown = Upmark.convert(html)
puts markdown
"messenger **bag** skateboard"

It can handle most HTML tags and anything that isn’t able to be converted to Markdown is passed through as HTML.

How does it work?

Upmark does all the heavy lifting using a parser transformer built using the excellent Parslet library. Parslet allows you to define a grammar in plain ruby that is used to parse a document into a syntax tree. The syntax tree can then be arbitrarily transformed, in our case it is transformed into a Markdown document.

The whole process looks something like this:

image Author provided Parse it! The first phase of the process is parsing the input into a syntax tree. To parse a HTML document we first need to define a grammar. A grammar contains the individual rules for parsing the different parts of a document. Rules for parsing simpler elements can be combined together to parse more complex structures. Parslet provides us with the Parslet::Parser class which we extend to define parser: class MyParser < Parslet::Parser # all the rules go here end In the case of Upmark, we first define rules for parsing the more complex parts of a HTML document, like an element. The rule for parsing an element is then decomposed into rules for parsing tags and attributes. These rules are then further broken down into combinations of simpler rules for text, numbers, and whitespace. Consider the following snippet of HTML: <p>hello world!</p> <img src="lol.gif" /> <ol> <li>one</li> <li>two</li> <li>three</li> </ol> This document is just a series of HTML elements, so the first rule we define might be: rule(:element) do start_tag.as(:start_tag) >> # e.g. "<p>" children.as(:children) >> end_tag.as(:end_tag) # e.g. "</p>" end This rule says that in order to parse an element we need a start_tag, some children, and finally an end_tag. The as modifiers define how they are labelled in the resulting syntax tree. Okay, so now what? Let’s break it down further and add the next rule to our parser. To parse a start_tag we need a < character, a name, zero or more attributes (separated by whitespace), some optional whitespace, and finally a > character. rule(:start_tag) do str('<') >> name.as(:name) >> (space >> attribute).repeat.as(:attributes) >> space? >> str('>') end According to the XML spec, a name is just a string limited to a particular range of characters: rule(:name) do match(/[a-zA-Z_:]/) >> match(/[\w:\.-]/).repeat end Here are the rules for parsing whitespace: rule(:space) { match(/\s/).repeat(1) } rule(:space?) { space.maybe } I’ll leave defining the rules for parsing children and attributes as an exercise for the reader (or you can cheat and just look in the Upmark source code). Finally, this is how we apply our parser to the input: tree = MyParser.new.parse(html) Once our parser is applied to a document, a syntax tree is generated. Transform it! The second phase of the whole process is to transform the syntax tree into some desired output. Parslet syntax trees are represented as an array of nested hashes. For example: tree = [ { element: { name: "img", attributes: [{name: "src", value: "http://example.com/lol.gif"}], children: [] } } ] Given the above syntax tree, let’s write a transform which traverses the syntax tree and converts it to Markdown. Again, Parslet makes transforming easier for us by providing the Parslet::Transform class to extend: class MyTransform < Parslet::Transform rule( element: { name: "img", attributes: subtree(:attributes) } ) do |img| src = img[:attributes].find {|attribute| attribute["name"] == "src" }["value"] "![](#{src})" end end The MyTransform transform matches an img element with a subtree of attributes. It then plucks out the src attribute and returns the Markdown for an image. This is how we apply the transform to the syntax tree: markdown = MyTransform.new.apply(tree) puts markdown "![](http://example.com/lol.gif)" Turtles all the way down So how did we write a parser that converts an entire HTML document to Markdown? The answer is simple: it’s turtles all the way down. By combining multiple rules and transforms, we can break a big problem down into a series of smaller problems. Hopefully this gives you some insight into how to write your own parser using Parslet, and if you happen to need a handy HTML to Markdown converter then please check out Upmark.

Authors: Josh Bassett, Data Platform Technical Lead, The Conversation

Read more http://theconversation.com/converting-html-to-markdown-with-upmark-65788

Business News

Cost Savings and Benefits of Using Used Pallets in Logistics

In today’s competitive logistics and supply chain industry, businesses are constantly looking for ways to reduce operational costs without compromising efficiency and reliability. One of the most prac...

Daily Bulletin - avatar Daily Bulletin

How Fulfilment Services in Australia Help Businesses Scale Efficiently

The growth of e-commerce and modern retail has transformed customer expectations. Consumers now expect fast shipping, accurate order processing, and seamless delivery experiences regardless of where...

Daily Bulletin - avatar Daily Bulletin

Practical Ways Australian Workplaces Can Reduce Operating Costs

Reducing business costs doesn’t always mean cutting staff, shrinking services or making the workplace feel bare-bones. In many cases, the smarter savings are hiding in everyday operations: the light...

Daily Bulletin - avatar Daily Bulletin

Executive Recruitment Solutions That Help Organisations Secure Exceptional Leaders

Leadership has a direct impact on organisational performance, employee engagement, strategic growth, and long-term success. Businesses operating in increasingly competitive environments require experi...

Daily Bulletin - avatar Daily Bulletin

Why A WooCommerce Website Designer Matters For Online Growth

Running an online store today requires more than simply listing products and waiting for customers to arrive. Businesses need a website that is fast, reliable, easy to navigate, and designed to suppor...

Daily Bulletin - avatar Daily Bulletin

Turning Your Empty Tables into Revenue

The rise of AI demand tools in hospitality, the EatClub–CommBank partnership, and seven trends reshaping Australian dining  A growing number of Australian venues are turning to AI-powered demand ma...

Daily Bulletin - avatar Daily Bulletin

High-Impact Dental Marketing Strategies That Are Driving Real Practice Growth Today

The landscape of dental practice growth in Australia has shifted dramatically over recent years. Standard, broad-spectrum advertising campaigns no longer yield the return on investment they once did. ...

Daily Bulletin - avatar Daily Bulletin

How Telematics Helps Australian Companies Improve Productivity

Operating a commercial fleet in Australia is a uniquely demanding endeavour. Between the sprawling urban sprawl of cities like Sydney and Melbourne and the immense, unforgiving stretches of the Outb...

Daily Bulletin - avatar Daily Bulletin

Inside the Icon: The BridgeMuseum Officially Opens at the Sydney Harbour Bridge

A bold new way to experience one of Australia’s most recognisable landmarks has arrived, with BridgeClimb Sydney officially opening the all-new BridgeMuseum.  Located inside the Sydney Harbour Bridge...

Daily Bulletin - avatar Daily Bulletin

The Daily Magazine

Traffic Light System Solutions For Safer And More Efficient Traffic Management

Modern cities and growing communities rely heavily on effective traffic management to ensure safety...

Gold Migration Lawyers in Liquidation: How the Closure Affects Your ART Appeal

If your appeal was with Gold Migration Lawyers, a recent change to how the Tribunal decides cases ...

The pressure cooker: life in urban Australia in 2026

Australian cities have always been demanding. Long commutes, rising housing costs, busy schedules a...

What Actually Makes a Good Criminal Lawyer in Melbourne

Most people only think about this question once. That is usually too late. Most people charged wi...

Why Working With A Chatswood Tutor Can Improve Academic Performance

Academic expectations continue increasing for students across primary school, high school, and senio...

Is It Worth Getting Solar Panels in Melbourne?

The real question is not whether solar works in Melbourne. It works. The question is what it is co...

How A Diploma Of Project Management Builds Practical Skills For Modern Work Environments

Developing the ability to plan, execute, and deliver outcomes efficiently is a key requirement in to...

How to Choose the Right Football for Every Level

Choosing a football may seem straightforward, but the right option depends on who will be using it a...

What to Ask a Wedding Photographer Before You Book

Booking a wedding photographer can feel deceptively simple: you like the photos, you like the vibe...