Lambda Soup - Functional HTML Scraping for OCaml

Easy functional HTML scraping and manipulation.

Lambda Soup is an HTML data extraction and analysis library. It supports CSS selectors, DOM traversals, mutation, and HTML output. This very documentation page was generated by ocamldoc and then rewritten by Lambda Soup!

Here are some usage examples:

open Soup

let soup = read_channel stdin |> parse in

(* Print the page title. *)
soup $ "title" |> R.leaf_text |> print_endline;

(* Print the targets of all links. *)
soup $$ "a[href]"
|> iter (fun a -> print_endline (R.attribute "href" a));

(* Find the first unordered list. *)
let ul = soup $ "ul" in

(* Print the contents of all its items. *)
ul $$ "li"
|> iter (fun li ->
  trimmed_texts li |> String.concat "" |> print_endline)

(* Find all subsequent sibling elements of the same list. *)
let _ = ul $$ "~ *" in

(* Find all previous sibling elements instead. *)
let _ = ul |> previous_siblings |> elements in

(* ... *)

Lambda Soup is based around two kind of values: nodes, which represent HTML elements, text content, and so on, and traversals, which are lazy sequences of nodes. The top-level node is the soup node (a.k.a. document node), which you typically get by calling parse on a string containing HTML.

Once you have a node, you call select on it to traverse to other nodes using CSS. There are also specialized functions, such as ancestors and previous_siblings, which allow you to traverse in directions that CSS cannot express.

Traversals can be manipulated with familiar combinators such as map, fold, and filter. They can also be terminated early.

Once you have traversed to a node you are interested in, you can extract its content or attributes, mutate it, cause other side effects, begin another traversal, or do anything else your application requires. Enjoy!

Lambda Soup is developed on GitHub and distributed under the BSD license.

This documentation page is for version 0.6.1 of the library. Documentation for other versions can be downloaded from the releases page.

Types

"Phantom" types for use with 'a node. See explanation below.

Module contents

[Top]

Types
High-level interface
Options
Early termination
Element access
Content access
Elementary traversals
Combinators
Projection
Convenience
Printing
Parsing signals
Equality
Mutation
I/O

GitHub

HTML nodes. These come in three varieties: element node represents a node that is known to be an element, soup node represents an entire document, and general node represents a node that might be anything, including an element, a document, or text. There is no phantom type specifically for text nodes.

Throughout Lambda Soup, if a function can operate on any kind of node, the argument is typed at 'a node. If an element node or the entire document is required, the argument type is element node or soup node, respectively. general node is the result of a function that can't guarantee that it evaluates to only elements or only documents.

Sequence of nodes. This is always instantiated as either element nodes or or general nodes. The sequence is lazy in the sense that only as many elements as needed are evaluated. This can be used with with_stop to traverse part of a document until some condition is met.

High-level interface

Parses the given HTML and produces a document node. Entity references are resolved. The character encoding is detected automatically.

If you need to parse XML, want finer control over parsing, or want to feed Lambda Soup something other than bytes, see Parsing signals.

select selector node is all the descendants of node matching CSS selector selector. All CSS3 selectors are supported, except those which imply layout or a user interface:

:link, :visited, :hover, :active, :focus, :target, :lang, :enabled,
:disabled, :checked, :indeterminate, ::first-line, ::first-letter,
::selection, ::before, ::after

XML namespace selectors are not supported. Lambda Soup supports the canceled :contains("foo") pseudo-class.

In regular CSS, a selector cannot start with a combinator such as >. Lambda Soup allows selectors such as > p, + p, and ~ p, which select immediate children of node, adjacent next siblings, and all next siblings, respectively.

In addition, you can use the empty selector to select node itself. In this case, note that if node is not an element (for example, it is often the soup node), select will result in nothing: select always results in sequences of element nodes only.

Like select, but evaluates to at most one element. Note that there is also R.select_one if you don't want an optional result, which is explained at require.

node $$ selector is the same as select selector node.

node $? selector is the same as select_one selector node.

node $ selector is the same as select_one selector node |> require.

Open Soup.Infix instead of Soup to introduce only the infix operators $$, $?, and $ into your scope.

Options

require (Some v) evaluates to v, and require None raises Failure.

Early termination

Used for early termination. See with_stop below.

with_stop (fun stop -> e) behaves as e. However, if the evaluation of e calls stop.throw v, the whole expression immediately evaluates to v instead.

For example, here is an expression that finds the first node with a draggable attribute, stopping traversal immediately when that occurs:

with_stop (fun stop ->
  some_root_node
  |> descendants
  |> elements
  |> iter (fun element ->
    if has_attribute "draggable" element then
      stop.throw (Some element));
  None)

Element access

The element's tag name. For example, an <a> element has tag name a. All tag names are converted to lowercase.

attribute attr element retrieves the value of attribute attr from the given element.

The element's class list. For example, <a class="foo bar"> has class list ["foo"; "bar"].

has_attribute attr element indicates whether element has attribute attr.

fold_attributes f init element applies f successively to the names and values of the attributes of element. The first string argument to f is the attribute name, and the second is the value.

Given any node, asserts that it is an element e. If so, evaluates to Some e. Otherwise, evaluates to None.

Filters non-elements from a sequence of nodes.

Indicates whether the given node is an element.

Content access

texts node is the content of all text nodes that are descendants of node. If node is itself a text node, evaluates to node's content.

Same as texts, but all strings are passed through String.trim, and then all empty strings are filtered out.

leaf_text node retrieves the content of one text node in node:

If node is a text node itself, with value s, leaf_text node evaluates to Some s.
If node is an element or soup node, then, leaf_text node filters out all text children of node containing only whitespace. If there is only one child child remaining, it evaluates to leaf_text child. If there are no children remaining, it evaluates to Some "". If there are two or more children remaining, it evaluates to None.

Here are some examples of what leaf_text produces for various nodes:

some text                                =>   Some "some text"
<p>some text</p>                         =>   Some "some text"
<div><p>some text</p></div>              =>   Some "some text"
<div> <p>some text</p></div>             =>   Some "some text"
<div><p>some text</p><p>more</p></div>   =>   None
<div></div>                              =>   Some ""

Elementary traversals

children node is the sequence of all children of node. If node is a text node, the traversal is empty.

descendants node is the sequence of all descendants of node. node is not considered to be its own descendant. If node is a text node, the traversal is empty.

ancestors node is the sequence of all ancestors of node. node is not considered to be its own ancestor. The soup node is not included. Ancestors are ordered by proximity to node, i.e. the sequence goes up the DOM tree to a root element.

siblings node is the sequence of all siblings of node. node is not considered to be its own sibling. The siblings are ordered as they appear in the child list of node's parent.

Like siblings, but only those siblings which follow node in its parent's child list.

Like siblings, but only those siblings which precede node in its parent's child list, and ordered by proximity to node, i.e. the reverse order of appearance in node's parent's child list.

Combinators

fold f init s folds f over the nodes of s, i.e. if s is n, n', n'', ..., evaluates f (f (f init n) n') n'' ....

filter f s is the sequence consisting of the nodes n of s for which f n evaluates to true.

map f s is the sequence consisting of nodes f n for each node n of s.

filter_map f s is the sequence consisting of nodes n' for each node n of s for which f n evaluates to Some n'. Nodes for which f n evaluates to None are dropped.

flatten f s is the sequence consisting of the concatenation of all the sequences f n for each n in s.

iter f s applies f to each node in s.

Reverses the given node sequence. Note that this forces traversal of the sequence.

Converts the given node sequence to a list.

Projection

nth n s evaluates to the nth member of s, if it is present. The index is 1-based. This is for consistency with the CSS :nth-child selectors.

Evaluates to the first node of the given sequence.

Evaluates the entire given sequence and returns the last node.

Evaluates the entire given sequence and then returns the number of nodes.

Evaluates to the index of the given node in its parent's child list. If the node has no parent, the index is 1. The index is 1-based, according to CSS convention.

Evaluates to the element index of the given element in the parent's child list. That is, the index of the given element when the parent's non-element children are disregarded. The index is 1-based, according to CSS convention.

Convenience

Evaluates to all descendant elements of the given node that have the given tag name. For example, some_root_node |> tags "a" is a sequence of all a elements under some_root_node. It is equivalent to

some_root_node
|> descendants |> elements |> filter (fun e -> name e = "a")

and

some_root_node $$ "a"

Tag names are case-insensitive.

Like tags, but evaluates to only the first element.

Given a node, evaluates to its parent element, if it has one. Note that root nodes do not have a parent element, as their parent is the soup node. Equivalent to n |> ancestors |> first.

Indicates whether the given node is not a soup node, and either has no parent, or its parent is a soup node.

child node evaluates to node's first child. Equivalent to node |> children |> first.

child_element node evaluates to node's first child element. Equivalent to node |> children |> elements |> first.

next_sibling node is the next sibling of node in node's parent's child list. Equivalent to node |> next_siblings |> first.

Like next_sibling, but for the preceding sibling instead.

next_element node is the next sibling of node that is an element. Equivalent to n |> next_siblings |> elements |> first.

Like next_element, but for the preceding siblings instead.

Indicates whether the given node has no child nodes.

Indicates whether the given node has at most one child node.

Printing

Converts the node tree rooted at the given node to an HTML5 string, preserving whitespace nodes.

Converts the node tree rooted at the given node to an HTML5 string formatted for easy reading by humans, difference algorthims, etc.

Note that this can change the whitespace structure of the HTML, so it may display differently in a browser than the original parsed document.

Parsing signals

Lambda Soup uses Markup.ml internally to parse and write markup. If you wish to:

Converts the node tree rooted at the given node to a stream of Markup.ml signals. This underlies to_string and pretty_print.

For example, you can use this function together with Markup.write_xml to output XML, instead of HTML:

soup |> signals |> Markup.write_xml |> Markup.to_string

Converts a stream of Markup.ml signals to a Lambda Soup document. This underlies parse.

For example, you can use this function together with Markup.parse_xml to load XML into Lambda Soup:

Markup.string s |> Markup.parse_xml |> Markup.signals |> from_signals

Namespaces are ignored at the moment.

Equality

equal n n' recursively tests the node trees rooted at n and n' for equality. To test true, the trees must be identical, including whitespace text nodes. Class attributes and other multi-valued attributes are compared literally: classes must be listed in the same order, with the same amount of whitespace in the attribute value. For the purposes of comparison, adjacent text nodes are merged, and empty text nodes are ignored: this is the standard HTML normalization procedure.

equal_modulo_whitespace n n' is like equal n n', but all text nodes have their values passed through String.trim. Text nodes that become empty are then ignored for the purpose of comparison, as in equal.

Mutation

create_element tag creates a new element with the name tag.

If ~attributes is specified, the given attributes are added to the element. ~attributes defaults to [].

If ~classes is specified, the class names are concatenated into a single string s and the class attribute is set on the element to the resulting value. This takes precedence over ~attributes.

If ~class is specified, the class is set on the element. This takes precedence over both ~attributes and ~classes.

If ~id is specified, the id is set. This takes precedence over ~attributes.

If ~inner_text is specified, a text node is created with the given string, and made the single child of the new element.

Creates a new text node with the given content.

Creates a new empty document node.

append_child element node adds node to the end of the child list of element.

prepend_child element node adds node to the beginning of the child list of element.

insert_at_index k element node makes node the kth child of element. Note that the index is 1-based. If k is outside the range of current valid indices, node is inserted at the beginning or end of element's child list.

insert_before node node' inserts node' immediately before node in node's parent's child list.

insert_after node node' inserts node' immediately after node in node's parent's child list.

Deletes the given node by unlinking it from its parent. If the node has descendants, they are implicitly deleted by this operation as well, in the sense that they become unreachable from the parent.

Unlinks all children of the given node.

replace node node' replaces node with node' in node's parent's child list. All descendants of node are implicitly deleted by this operation, because they become unreachable from node's parent.

swap element element' replaces element with element' in element's parent's child list. All children of element are transferred to element', and all original children of element' are transferred to element.

wrap node element inserts element in the place of node, and then makes node element's child. All original children of element are unlinked.

unwrap node unlinks node, and inserts all of node's children as children of node's parent at the former location of node.

append_root soup node adds node as the last root node of soup.

Sets the tag name of the given element.

set_attribute attr value element sets the value of attribute attr on element to value. If the attribute is not present, it is added to element. If it is already present, the value is replaced.

Removes the given attribute from the given element. If the attribute is not present, has no effect.

add_class c element adds class c to element, if element does not already have class c.

remove_class c element removes class c from element, if element has class c.

I/O

Lambda Soup is not an I/O library. However, it provides a few simple helpers based on standard I/O functions in Pervasives. These should not be used for "serious" code. They are only for when you need to get something done quickly, and/or don't care about corner cases, concurrency, or excellent reliability. In such cases, they allow you to avoid writing I/O wrappers or using additional libraries.

Reads the entire contents of the file with the given path. Raises Sys_error on failure.

Reads all bytes from the given channel.

write_file path data writes data to the file given by path. If the file already exists, it is truncated (erased). If you want to append to file, use open_out_gen with the necessary flags, and pass the resulting channel to write_channel. Raises Sys_error on failure.