HTML for non-web developers

skills
beginner
HTML
JSON
XML
Published

March 28, 2024

No feedback found for this session

HTML?

  • = HyperText Markup Language
  • developed around 1990
  • since c. 1993, the standard language for writing web pages

HTML structure

  • HTML is based on tags
  • tags add information to text
  • to make a heading in HTML, we’d take our heading text, and surround it with h2 tags:
    • <h2>This is my heading</h2>
  • hundreds of tags in use

But!

  • this session isn’t going to teach you loads of tags
  • for web developers, yes, do learn them! (also all the css and java that are used to display and style the information in the html)
  • but for everyone else - it’s being able to pick out the tag-based structure that’s useful
  • family of similar languages (XML, JSON) that operate in a similar way

Why do this?

  • HTML is mainly used to write websites
  • So why bother with HTML if you don’t?

Swiss Army Knife

image: publicdomainvectors.org

What makes HTML generically useful?

  • tags lend themselves to semantic markup
  • Say what an element is, rather than how it should look Word’s styles are a nice example of semantic markup

…and not

But random formatting in Word isn’t semantic

Where might you find HTML?

(not surprising)

(possibly surprising)

Make some HTML

Task
  • Create a text file in Windows Notepad (or other text editor)
  • add some text
  • save that file with the .html file extension
  • open in your web browser

What does it look like?

HTML doesn’t respect whitespace, formatting, etc

Add some tags

  • reminder: tags are how we add information to the text
Task
  • add some line-breaks with </br>
  • add a horizontal line with <hr>

Paired tags

  • most HTML tags are paired, with a <start> and </stop> component
Task
  • add a heading with <h1>Heading</h1>
  • add a link with <a href="URL">Link</a>

Images

  • we can also put extra information inside the tag itself to control how our HTML works. Images are a nice example. You start with a simple <img> tag, and then add components. For example, you’ll nearly always supply the web address of your image:

<img src="https://i.imgur.com/OpmMr44.jpg">

You can also resize the image using HTML, by adding height/width info to the tag:

<img src="https://i.imgur.com/OpmMr44.jpg" height="300">

And you can play with the alignment, padding, etc etc:

<img src="https://i.imgur.com/OpmMr44.jpg" height="300" ALIGN="left" HSPACE="15" VSPACE="25"/>

(this goes a bit wonky in the quarto we use to build the training pages - definitely worth trying out yourself in your HTML file)








## Tables

Tables are built by using several tags in concert. Say you want to re-build this bit of KIND network archaeology from a couple of years ago:

Webinar Date
Nov 2022 (R) 2022-11-16, 2-3pm
Dec 2022 (KIND conference) 2022-12-07, 1-5pm

You’ll need a few different tags:

  • <table> to start the whole thing off
  • <thead> to write the (different looking) header row
  • <tbody> to contain the body of the table
  • <tr> to describe what should go in a row
  • <th> to describe individual values

The trick is that they are nested together, so for a one-cell table with a header you might go like this:

<table><tbody><tr><th>Bit of content</th></tr></tbody></table>
Bit of content

Back to our proper example, following the same idea about nesting:

<table>
  <thead>
    <tr class="header">
      <th align="left">Webinar</th>
      <th align="left">Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td align="left">Nov 2022 (R)</td>
      <td align="left">2022-11-16, 2-3pm</td>
    </tr>
      <td align="left">Dec 2022 (KIND conference)</td>
      <td align="left">2022-12-07, 1-5pm</td>
    </tr>
  </tbody>
</table>
Webinar Date
Nov 2022 (R) 2022-11-16, 2-3pm
Dec 2022 (KIND conference) 2022-12-07, 1-5pm

Practical example: Data munging

  1. (the hard bit) make/find an HTML version of your document
    1. Take a lot of links in a document
    2. Save the Word doc as html and tidy it up
  2. Open that webpage source using Notepad
  3. Paste html code into Excel
  4. Delete the columns you don’t need. (try F5, selecting ‘Special’, then ‘Blanks’, then Ctrl+-)
  5. Use Text to Columns tool to remove html bits (split on ")

Practical example: parsing HTML to extract data

  • you often want to pull data out of HTML-based sources, like webpages. We’d call that parsing HTML. Here’s a mini-example in R:
library(xml2)

"<p>This is some text. This is <b>bold!</b></p>" |>
  read_html() |>
  xml_text(trim = TRUE) 
[1] "This is some text. This is bold!"
  • the advice here is to use a proper tool, rather than trying to dig all the content out of the tags yourself by e.g. finding and replacing
  • definitely worth reading this in-depth and entertaining thread about the issue: “You can’t parse [X]HTML with regex”
here::here("skills/data/doc_links.htm") |>
  xml2::read_html() |>
  xml2::xml_find_all("//@href") |>
  xml2::as_list() |>
  purrr::flatten() |>
  unlist() |>
  sample(5) # just a random sample of five to avoid overload
[1] "https://587.example.com" "https://170.example.com"
[3] "https://437.example.com" "https://166.example.com"
[5] "https://661.example.com"