Benji Fisher
January 31, 2020
Benji Fisher
Hook 42
drupal.org: benjifisher
twitter: @benji17fisher
Find a link to this presentation on my GitLab Pages:
A robust, flexible tool.
What about unstructured text with HTML Markup?
Our approach: we wrote new Migrate process plugins in Migrate Plus for Pega Systems/Isovera.
HTML
↓
Regular expression
↓
HTML or extract strings
Extract the URL from
Extract the URL:
Complications:
a
or A
class
, id
,
name
, …\"
) allowed in a URL?Trick question: do not reinvent the wheel!
Complications:
<a href="https://www.drupal.org">Drupal home page</a>
(original)<A href="https://www.drupal.org">Drupal home page</A>
<a class="ext-link" href=...
<a href='https://www.drupal.org'>Drupal home page</a>
<a href="https://www.dr\"upal.org">Drupal home page</a>
From StackOverflow:
I need to match all of these opening tags:
But not these:
The answer:
You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. …
The answer:
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. …
The answer:
Have you tried using an XML parser instead?
HTML
↓
Document Object Model (DOM)
↓
HTML or extract strings
The DOM extension uses GNOME’s libxml
library in the
background. DOM includes XML Path Language (XPath) traversing.
With $xpath->query($selector)
, …
$selector |
Matches |
---|---|
//a |
all <a> elements |
//a[class="external"] |
all <a> elements with
class="external" |
//li[class="nav"]/a |
all <a> elements direct children of
<li class="nav"> |
ANEDS/ANED[
@s:id = ../ANEDOA[
nc:OR/@s:ref = "ORG0"
]/CDR/@s:ref
]/NED/NEDQ[
../NEDRC/text() = "ExpeditedDenial"
]
From usdoj/foia-api on GitHub (whitespace added, tags abbreviated)
After processing, return an HTML string:
In Drupal 8, the Migrate API follows the standard Extract, Transform, Load (ETL) structure:
The Transform/process phase is the right place to handle HTML processing.
HTML
↓
Migrate dom*
process plugins
↓
HTML
Four process plugins in the Migrate Plus module:
dom
dom_str_replace
dom_migration_lookup
dom_apply_styles
Goal: make it easy to process text fields with proper HTML parsing.
dom
plugindom_str_replace
pluginChange the subdomain during migration:
-
plugin: dom_str_replace
mode: attribute
xpath: '//a'
attribute_options:
name: href
search: 'documentation.example.com'
replace: 'help.example.com'
Use str_replace()
or preg_replace()
on the
href
attribute.
dom_apply_styles
pluginSearch for an XPath expression. Replace with styles configured in the Editor module.
dom_migration_lookup
Like core Migrate’s migration_lookup
plugin.
dom_str_replace
Just an FYI, my goto for HTML parsing has been querypath, it’s especially good if you’re dealing with old-school HTML (no
</p>
, etc.). ‒ mikeryan on #2958281-7
url
source plugin data parsersMasterminds\HTML5::loadHTML()
->
\DOMDocument
This slide deck by
Benji
Fisher is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
Based on a work at
https://gitlab.com/benjifisher/slide-decks.