New Module: XPATH Fetch Page

We are introducing a new module, the XPATH Fetch Page.

We are also going to deprecate the Fetch Page module at the end of June. So please convert your existing Pipes that use the Fetch Page module to the XPATH Fetch Page module.

To use the XPATH Fetch Page module, first enter the URL of the site you want. By default, the module will output the DOM elements as items in the preview pane. You can optionally use the “Emit items as string” checkbox if you need the html as a string.

You can use the “Extract using XPATH” field to fine tune what you need from the HTML Page. For example, if I want all the links in the page I can simply use “//a” to grab all links. If I want all the images in the html I can do “//img”. Read more on XPATH. You can also find XPATH statements using firebug/developer tools to target data that you want in a HTML page.

Previously with the older Fetch Page module you would have to wrangle with regex, splits and other complicated methods to get the data you wanted. The new XPATH Fetch Page module is more powerful, easier and more inline with todays standards.

Currently this module will extract the page and fix malformed tags using Tidy. You have the option to run the parser using support for HTML4 (by default) or checking the “Use HTML5 parser” checkbox to use the HTML5 parser. We recommend using the HTML5 parser when using this module for most cases.

Click here for an example Pipe using the new XPATH Fetch Page module.

You can also use Pipes special variable substitution method (e.g ${<dom node path here>}) to construct new content from the dom nodes. For example:

${td.0.span.0.a.content} will pull content from that dom path, viewed in the preview pane.

and a longer example:

Company name / Ticker: ${td.0.a.content} ${td.0.p}<br>Underwriter: ${td.1.p.content.0} ${td.1.p.content.1}<br>Price Range: ${td.2.p.content}<br>Shares: ${td.3.p}<br>Pricing Date: ${td.4.p}

You can use the Regex module to help you build this new content. View an example that uses this method here.

Note on usage: The module will only fetch HTML pages under 1.5MB and the page must also be indexable (e.g. allowed by the site’s robots.txt file.) If you do not want your page made available to this module, please add it to your robots.txt file, or add the following tag into the page’s element:

<meta name=”robots” content=”noindex”>