module Soup:sig..end
Lambda Soup is an HTML data extraction and analysis library. It supports CSS
selectors, DOM traversals, mutation, and HTML output. This very documentation
page was generated by ocamldoc and then
rewritten by Lambda Soup!
Here are some usage examples:
open Soup
let soup = read_channel stdin |> parse in
(* Print the page title. *)
soup $ "title" |> R.leaf_text |> print_endline;
(* Print the targets of all links. *)
soup $$ "a[href]"
|> iter (fun a -> print_endline (R.attribute "href" a));
(* Find the first unordered list. *)
let ul = soup $ "ul" in
(* Print the contents of all its items. *)
ul $$ "li"
|> iter (fun li ->
trimmed_texts li |> String.concat "" |> print_endline)
(* Find all subsequent sibling elements of the same list. *)
let _ = ul $$ "~ *" in
(* Find all previous sibling elements instead. *)
let _ = ul |> previous_siblings |> elements in
(* ... *)
Lambda Soup is based around two kind of values: nodes, which represent
HTML elements, text content, and so on, and traversals, which are lazy
sequences of nodes. The top-level node is the soup node (a.k.a. document
node), which you typically get by calling parse on a string containing
HTML.
Once you have a node, you call select on it to traverse to other nodes
using CSS. There are also specialized functions, such as ancestors and
previous_siblings, which allow you to traverse in directions that CSS
cannot express.
Traversals can be manipulated with familiar combinators such as map,
fold, and filter. They can also be terminated early.
Once you have traversed to a node you are interested in, you can extract its content or attributes, mutate it, cause other side effects, begin another traversal, or do anything else your application requires. Enjoy!
Lambda Soup is developed on GitHub and distributed under the BSD license.
This documentation page is for version 0.6.1 of the library. Documentation
for other versions can be downloaded from the
releases page.
type element
type general
type soup
'a node. See explanation below.type 'a node
element node represents a node
that is known to be an element, soup node represents an entire document,
and general node represents a node that might be anything, including an
element, a document, or text. There is no phantom type specifically for text
nodes.
Throughout Lambda Soup, if a function can operate on any kind of node, the
argument is typed at 'a node. If an element node or the entire document is
required, the argument type is element node or soup node,
respectively. general node is the result of a function that can't
guarantee that it evaluates to only elements or only documents.
type 'a nodes
element nodes or
or general nodes. The sequence is lazy in the sense that only as many
elements as needed are evaluated. This can be used with with_stop to
traverse part of a document until some condition is met.val parse : string -> soup node
If you need to parse XML, want finer control over parsing, or want to feed
Lambda Soup something other than bytes, see Parsing
signals.
val select : string -> 'a node -> element nodesselect selector node is all the descendants of node matching CSS
selector selector. All
CSS3 selectors are
supported, except those which imply layout or a user interface:
:link, :visited, :hover, :active, :focus, :target, :lang, :enabled,
:disabled, :checked, :indeterminate, ::first-line, ::first-letter,
::selection, ::before, ::after
XML namespace selectors are not supported. Lambda Soup supports the canceled
:contains("foo") pseudo-class.
In regular CSS, a selector cannot start with a combinator such as >.
Lambda Soup allows selectors such as > p, + p, and ~ p, which select
immediate children of node, adjacent next siblings, and all next siblings,
respectively.
In addition, you can use the empty selector to select node itself. In this
case, note that if node is not an element (for example, it is often the
soup node), select will result in nothing: select always results in
sequences of element nodes only.
val select_one : string -> 'a node -> element node optionselect, but evaluates to at most one element. Note that there is also
R.select_one if you don't want an optional result, which is explained at
require.val ($$) : 'a node -> string -> element nodesnode $$ selector is the same as select selector node.val ($?) : 'a node -> string -> element node optionnode $? selector is the same as select_one selector node.val ($) : 'a node -> string -> element nodenode $ selector is the same as select_one selector node |> require.module Infix :sig..end
Soup.Infix instead of Soup to introduce only the infix operators
$$, $?, and $ into your scope.
val require : 'a option -> 'arequire (Some v) evaluates to v, and require None raises Failure.module R :sig..end
Soup.f that evaluates to an option, Soup.R.f is a
version of f that is post-composed with require, so, for example, you
can write soup |> children |> R.first instead of
soup |> children |> first |> require.
type 'a stop = {
|
throw : |
with_stop below.val with_stop : ('a stop -> 'a) -> 'awith_stop (fun stop -> e) behaves as e. However, if the evaluation of
e calls stop.throw v, the whole expression immediately evaluates to v
instead.
For example, here is an expression that finds the first node with a
draggable attribute, stopping traversal immediately when that occurs:
with_stop (fun stop ->
some_root_node
|> descendants
|> elements
|> iter (fun element ->
if has_attribute "draggable" element then
stop.throw (Some element));
None)
val name : element node -> string<a> element has tag name a. All
tag names are converted to lowercase.val attribute : string -> element node -> string optionattribute attr element retrieves the value of attribute attr from the
given element.val classes : element node -> string list<a class="foo bar"> has class list
["foo"; "bar"].val id : element node -> string optionval has_attribute : string -> element node -> boolhas_attribute attr element indicates whether element has attribute
attr.val fold_attributes : ('a -> string -> string -> 'a) -> 'a -> element node -> 'afold_attributes f init element applies f successively to the names and
values of the attributes of element. The first string argument to f is
the attribute name, and the second is the value.val element : 'a node -> element node optione. If so, evaluates to
Some e. Otherwise, evaluates to None.val elements : 'a nodes -> element nodesval is_element : 'a node -> boolval texts : 'a node -> string listtexts node is the content of all text nodes that are descendants of
node. If node is itself a text node, evaluates to node's content.val trimmed_texts : 'a node -> string listtexts, but all strings are passed through String.trim, and then
all empty strings are filtered out.val leaf_text : 'a node -> string optionleaf_text node retrieves the content of one text node in node:
node is a text node itself, with value s, leaf_text node
evaluates to Some s.node is an element or soup node, then, leaf_text node filters out
all text children of node containing only whitespace. If there is only
one child child remaining, it evaluates to leaf_text child. If there
are no children remaining, it evaluates to Some "". If there are two or
more children remaining, it evaluates to None.leaf_text produces for various nodes:
some text => Some "some text"
<p>some text</p> => Some "some text"
<div><p>some text</p></div> => Some "some text"
<div> <p>some text</p></div> => Some "some text"
<div><p>some text</p><p>more</p></div> => None
<div></div> => Some ""
val children : 'a node -> general nodeschildren node is the sequence of all children of node. If node is a
text node, the traversal is empty.val descendants : 'a node -> general nodesdescendants node is the sequence of all descendants of node. node is
not considered to be its own descendant. If node is a text node, the
traversal is empty.val ancestors : 'a node -> element nodesancestors node is the sequence of all ancestors of node. node is not
considered to be its own ancestor. The soup node is not included. Ancestors
are ordered by proximity to node, i.e. the sequence goes up the DOM tree
to a root element.val siblings : 'a node -> general nodessiblings node is the sequence of all siblings of node. node is not
considered to be its own sibling. The siblings are ordered as they appear in
the child list of node's parent.val next_siblings : 'a node -> general nodes
val previous_siblings : 'a node -> general nodessiblings, but only those siblings which precede node in its
parent's child list, and ordered by proximity to node, i.e. the reverse
order of appearance in node's parent's child list.val fold : ('a -> 'b node -> 'a) -> 'a -> 'b nodes -> 'afold f init s folds f over the nodes of s, i.e. if s is
n, n', n'', ..., evaluates f (f (f init n) n') n'' ....val filter : ('a node -> bool) -> 'a nodes -> 'a nodesfilter f s is the sequence consisting of the nodes n of s for which
f n evaluates to true.val map : ('a node -> 'b node) -> 'a nodes -> 'b nodesmap f s is the sequence consisting of nodes f n for each node n of
s.val filter_map : ('a node -> 'b node option) -> 'a nodes -> 'b nodesfilter_map f s is the sequence consisting of nodes n' for each node n
of s for which f n evaluates to Some n'. Nodes for which f n
evaluates to None are dropped.val flatten : ('a node -> 'b nodes) -> 'a nodes -> 'b nodesflatten f s is the sequence consisting of the concatenation of all the
sequences f n for each n in s.val iter : ('a node -> unit) -> 'a nodes -> unititer f s applies f to each node in s.val rev : 'a nodes -> 'a nodesval to_list : 'a nodes -> 'a node listval nth : int -> 'a nodes -> 'a node optionnth n s evaluates to the nth member of s, if it is present. The index
is 1-based. This is for consistency with the CSS :nth-child selectors.val first : 'a nodes -> 'a node optionval last : 'a nodes -> 'a node optionval count : 'a nodes -> intval index_of : 'a node -> intval index_of_element : element node -> intval tags : string -> 'a node -> element nodessome_root_node |> tags "a" is a sequence of all a
elements under some_root_node. It is equivalent to
some_root_node
|> descendants |> elements |> filter (fun e -> name e = "a")
and
some_root_node $$ "a"
Tag names are case-insensitive.
val tag : string -> 'a node -> element node optiontags, but evaluates to only the first element.val parent : 'a node -> element node optionn |> ancestors |> first.val is_root : 'a node -> boolval child : 'a node -> general node optionchild node evaluates to node's first child. Equivalent to
node |> children |> first.val child_element : 'a node -> element node optionchild_element node evaluates to node's first child element. Equivalent
to node |> children |> elements |> first.val next_sibling : 'a node -> general node optionnext_sibling node is the next sibling of node in node's parent's child
list. Equivalent to node |> next_siblings |> first.val previous_sibling : 'a node -> general node optionnext_sibling, but for the preceding sibling instead.val next_element : 'a node -> element node optionnext_element node is the next sibling of node that is an element.
Equivalent to n |> next_siblings |> elements |> first.val previous_element : 'a node -> element node optionnext_element, but for the preceding siblings instead.val no_children : 'a node -> boolval at_most_one_child : 'a node -> boolval to_string : 'a node -> stringval pretty_print : 'a node -> string
Note that this can change the whitespace structure of the HTML, so it may
display differently in a browser than the original parsed document.
Lambda Soup uses Markup.ml internally to parse and write markup. If you wish to:
parse and
to_string.
See the Markup.ml documentation for
a description of the types involved. The
Markup.ml
overview may be a good place to start.
val signals : 'a node -> (Markup.signal, Markup.sync) Markup.streamto_string and pretty_print.
For example, you can use this function together with
Markup.write_xml to
output XML, instead of HTML:
soup |> signals |> Markup.write_xml |> Markup.to_string
val from_signals : (Markup.signal, Markup.sync) Markup.stream -> soup nodeparse.
For example, you can use this function together with
Markup.parse_xml to
load XML into Lambda Soup:
Markup.string s |> Markup.parse_xml |> Markup.signals |> from_signals
Namespaces are ignored at the moment.
val equal : 'a node -> 'b node -> boolequal n n' recursively tests the node trees rooted at n and n' for
equality. To test true, the trees must be identical, including whitespace
text nodes. Class attributes and other multi-valued attributes are compared
literally: classes must be listed in the same order, with the same amount of
whitespace in the attribute value. For the purposes of comparison, adjacent
text nodes are merged, and empty text nodes are ignored: this is the
standard HTML normalization procedure.val equal_modulo_whitespace : 'a node -> 'b node -> boolequal_modulo_whitespace n n' is like equal n n', but all text nodes have
their values passed through String.trim. Text nodes that become empty are
then ignored for the purpose of comparison, as in equal.val create_element : ?id:string ->
?class_:string ->
?classes:string list ->
?attributes:(string * string) list ->
?inner_text:string -> string -> element nodecreate_element tag creates a new element with the name tag.
If ~attributes is specified, the given attributes are added to the
element. ~attributes defaults to [].
If ~classes is specified, the class names are concatenated into a single
string s and the class attribute is set on the element to the resulting
value. This takes precedence over ~attributes.
If ~class is specified, the class is set on the element. This takes
precedence over both ~attributes and ~classes.
If ~id is specified, the id is set. This takes precedence over
~attributes.
If ~inner_text is specified, a text node is created with the given string,
and made the single child of the new element.
val create_text : string -> general nodeval create_soup : unit -> soup nodeval append_child : element node -> 'a node -> unitappend_child element node adds node to the end of the child list of
element.val prepend_child : element node -> 'a node -> unitprepend_child element node adds node to the beginning of the child list
of element.val insert_at_index : int -> element node -> 'a node -> unitinsert_at_index k element node makes node the kth child of element.
Note that the index is 1-based. If k is outside the range of current valid
indices, node is inserted at the beginning or end of element's child
list.val insert_before : 'a node -> 'b node -> unitinsert_before node node' inserts node' immediately before node in
node's parent's child list.val insert_after : 'a node -> 'b node -> unitinsert_after node node' inserts node' immediately after node in
node's parent's child list.val delete : 'a node -> unitval clear : 'a node -> unitval replace : 'a node -> 'b node -> unitreplace node node' replaces node with node' in node's parent's child
list. All descendants of node are implicitly deleted by this operation,
because they become unreachable from node's parent.val swap : element node -> element node -> unitswap element element' replaces element with element' in element's
parent's child list. All children of element are transferred to
element', and all original children of element' are transferred to
element.val wrap : 'a node -> element node -> unitwrap node element inserts element in the place of node, and then makes
node element's child. All original children of element are
unlinked.val unwrap : 'a node -> unitunwrap node unlinks node, and inserts all of node's children as
children of node's parent at the former location of node.val append_root : soup node -> 'a node -> unitappend_root soup node adds node as the last root node of soup.val set_name : string -> element node -> unitval set_attribute : string -> string -> element node -> unitset_attribute attr value element sets the value of attribute attr on
element to value. If the attribute is not present, it is added to
element. If it is already present, the value is replaced.val delete_attribute : string -> element node -> unitval add_class : string -> element node -> unitadd_class c element adds class c to element, if element does not
already have class c.val remove_class : string -> element node -> unitremove_class c element removes class c from element, if element has
class c.
Lambda Soup is not an I/O library. However, it provides a few simple helpers
based on standard I/O functions in
Pervasives. These should not be used for "serious" code. They are only
for when you need to get something done quickly, and/or don't care about
corner cases, concurrency, or excellent reliability. In such cases, they
allow you to avoid writing I/O wrappers or using additional libraries.
Using these, you can write little command-line scrapers and filters:
let () =
let soup = read_channel stdin |> parse in
let () = (* ...do things to soup... *) in
soup |> to_string |> write_channel stdout
If the above is compiled to a file scrape, you can then run
curl -L "http://location.com" | ./scrape
val read_file : string -> stringSys_error on failure.val read_channel : Pervasives.in_channel -> stringval write_file : string -> string -> unitwrite_file path data writes data to the file given by path. If the
file already exists, it is truncated (erased). If you want to append to
file, use
open_out_gen with the necessary flags, and pass the resulting channel to
write_channel. Raises Sys_error on failure.val write_channel : Pervasives.out_channel -> string -> unit