module Soup:sig
..end
Lambda Soup is an HTML data extraction and analysis library. It supports CSS
selectors, DOM traversals, mutation, and HTML output. This very documentation
page was generated by ocamldoc
and then
rewritten by Lambda Soup!
Here are some usage examples:
open Soup
let soup = read_channel stdin |> parse in
(* Print the page title. *)
soup $ "title" |> R.leaf_text |> print_endline;
(* Print the targets of all links. *)
soup $$ "a[href]"
|> iter (fun a -> print_endline (R.attribute "href" a));
(* Find the first unordered list. *)
let ul = soup $ "ul" in
(* Print the contents of all its items. *)
ul $$ "li"
|> iter (fun li ->
trimmed_texts li |> String.concat "" |> print_endline)
(* Find all subsequent sibling elements of the same list. *)
let _ = ul $$ "~ *" in
(* Find all previous sibling elements instead. *)
let _ = ul |> previous_siblings |> elements in
(* ... *)
Lambda Soup is based around two kind of values: nodes, which represent
HTML elements, text content, and so on, and traversals, which are lazy
sequences of nodes. The top-level node is the soup node (a.k.a. document
node), which you typically get by calling parse
on a string containing
HTML.
Once you have a node, you call select
on it to traverse to other nodes
using CSS. There are also specialized functions, such as ancestors
and
previous_siblings
, which allow you to traverse in directions that CSS
cannot express.
Traversals can be manipulated with familiar combinators such as map
,
fold
, and filter
. They can also be terminated early.
Once you have traversed to a node you are interested in, you can extract its content or attributes, mutate it, cause other side effects, begin another traversal, or do anything else your application requires. Enjoy!
Lambda Soup is developed on GitHub and distributed under the BSD license.
This documentation page is for version 0.6.1 of the library. Documentation
for other versions can be downloaded from the
releases page.
type
element
type
general
type
soup
'a node
. See explanation below.type 'a
node
element node
represents a node
that is known to be an element, soup node
represents an entire document,
and general node
represents a node that might be anything, including an
element, a document, or text. There is no phantom type specifically for text
nodes.
Throughout Lambda Soup, if a function can operate on any kind of node, the
argument is typed at 'a node
. If an element node or the entire document is
required, the argument type is element node
or soup node
,
respectively. general node
is the result of a function that can't
guarantee that it evaluates to only elements or only documents.
type 'a
nodes
element nodes
or
or general nodes
. The sequence is lazy in the sense that only as many
elements as needed are evaluated. This can be used with with_stop
to
traverse part of a document until some condition is met.val parse : string -> soup node
If you need to parse XML, want finer control over parsing, or want to feed
Lambda Soup something other than bytes, see Parsing
signals.
val select : string -> 'a node -> element nodes
select selector node
is all the descendants of node
matching CSS
selector selector
. All
CSS3 selectors are
supported, except those which imply layout or a user interface:
:link, :visited, :hover, :active, :focus, :target, :lang, :enabled,
:disabled, :checked, :indeterminate, ::first-line, ::first-letter,
::selection, ::before, ::after
XML namespace selectors are not supported. Lambda Soup supports the canceled
:contains("foo")
pseudo-class.
In regular CSS, a selector cannot start with a combinator such as >
.
Lambda Soup allows selectors such as > p
, + p
, and ~ p
, which select
immediate children of node
, adjacent next siblings, and all next siblings,
respectively.
In addition, you can use the empty selector to select node
itself. In this
case, note that if node
is not an element (for example, it is often the
soup node), select
will result in nothing: select
always results in
sequences of element nodes only.
val select_one : string -> 'a node -> element node option
select
, but evaluates to at most one element. Note that there is also
R.select_one
if you don't want an optional result, which is explained at
require
.val ($$) : 'a node -> string -> element nodes
node $$ selector
is the same as select selector node
.val ($?) : 'a node -> string -> element node option
node $? selector
is the same as select_one selector node
.val ($) : 'a node -> string -> element node
node $ selector
is the same as select_one selector node |> require
.module Infix :sig
..end
Soup.Infix
instead of Soup
to introduce only the infix operators
$$
, $?
, and $
into your scope.
val require : 'a option -> 'a
require (Some v)
evaluates to v
, and require None
raises Failure
.module R :sig
..end
Soup.f
that evaluates to an option, Soup.R.f
is a
version of f
that is post-composed with require
, so, for example, you
can write soup |> children |> R.first
instead of
soup |> children |> first |> require
.
type 'a
stop = {
|
throw : |
with_stop
below.val with_stop : ('a stop -> 'a) -> 'a
with_stop (fun stop -> e)
behaves as e
. However, if the evaluation of
e
calls stop.throw v
, the whole expression immediately evaluates to v
instead.
For example, here is an expression that finds the first node with a
draggable
attribute, stopping traversal immediately when that occurs:
with_stop (fun stop ->
some_root_node
|> descendants
|> elements
|> iter (fun element ->
if has_attribute "draggable" element then
stop.throw (Some element));
None)
val name : element node -> string
<a>
element has tag name a
. All
tag names are converted to lowercase.val attribute : string -> element node -> string option
attribute attr element
retrieves the value of attribute attr
from the
given element.val classes : element node -> string list
<a class="foo bar">
has class list
["foo"; "bar"]
.val id : element node -> string option
val has_attribute : string -> element node -> bool
has_attribute attr element
indicates whether element
has attribute
attr
.val fold_attributes : ('a -> string -> string -> 'a) -> 'a -> element node -> 'a
fold_attributes f init element
applies f
successively to the names and
values of the attributes of element
. The first string
argument to f
is
the attribute name, and the second is the value.val element : 'a node -> element node option
e
. If so, evaluates to
Some e
. Otherwise, evaluates to None
.val elements : 'a nodes -> element nodes
val is_element : 'a node -> bool
val texts : 'a node -> string list
texts node
is the content of all text nodes that are descendants of
node
. If node
is itself a text node, evaluates to node
's content.val trimmed_texts : 'a node -> string list
texts
, but all strings are passed through String.trim
, and then
all empty strings are filtered out.val leaf_text : 'a node -> string option
leaf_text node
retrieves the content of one text node in node
:
node
is a text node itself, with value s
, leaf_text node
evaluates to Some s
.node
is an element or soup node, then, leaf_text node
filters out
all text children of node
containing only whitespace. If there is only
one child child
remaining, it evaluates to leaf_text child
. If there
are no children remaining, it evaluates to Some ""
. If there are two or
more children remaining, it evaluates to None
.leaf_text
produces for various nodes:
some text => Some "some text"
<p>some text</p> => Some "some text"
<div><p>some text</p></div> => Some "some text"
<div> <p>some text</p></div> => Some "some text"
<div><p>some text</p><p>more</p></div> => None
<div></div> => Some ""
val children : 'a node -> general nodes
children node
is the sequence of all children of node
. If node
is a
text node, the traversal is empty.val descendants : 'a node -> general nodes
descendants node
is the sequence of all descendants of node
. node
is
not considered to be its own descendant. If node
is a text node, the
traversal is empty.val ancestors : 'a node -> element nodes
ancestors node
is the sequence of all ancestors of node
. node
is not
considered to be its own ancestor. The soup node is not included. Ancestors
are ordered by proximity to node
, i.e. the sequence goes up the DOM tree
to a root element.val siblings : 'a node -> general nodes
siblings node
is the sequence of all siblings of node
. node
is not
considered to be its own sibling. The siblings are ordered as they appear in
the child list of node
's parent.val next_siblings : 'a node -> general nodes
val previous_siblings : 'a node -> general nodes
siblings
, but only those siblings which precede node
in its
parent's child list, and ordered by proximity to node
, i.e. the reverse
order of appearance in node
's parent's child list.val fold : ('a -> 'b node -> 'a) -> 'a -> 'b nodes -> 'a
fold f init s
folds f
over the nodes of s
, i.e. if s
is
n, n', n'', ...
, evaluates f (f (f init n) n') n'' ...
.val filter : ('a node -> bool) -> 'a nodes -> 'a nodes
filter f s
is the sequence consisting of the nodes n
of s
for which
f n
evaluates to true
.val map : ('a node -> 'b node) -> 'a nodes -> 'b nodes
map f s
is the sequence consisting of nodes f n
for each node n
of
s
.val filter_map : ('a node -> 'b node option) -> 'a nodes -> 'b nodes
filter_map f s
is the sequence consisting of nodes n'
for each node n
of s
for which f n
evaluates to Some n'
. Nodes for which f n
evaluates to None
are dropped.val flatten : ('a node -> 'b nodes) -> 'a nodes -> 'b nodes
flatten f s
is the sequence consisting of the concatenation of all the
sequences f n
for each n
in s
.val iter : ('a node -> unit) -> 'a nodes -> unit
iter f s
applies f
to each node in s
.val rev : 'a nodes -> 'a nodes
val to_list : 'a nodes -> 'a node list
val nth : int -> 'a nodes -> 'a node option
nth n s
evaluates to the n
th member of s
, if it is present. The index
is 1-based. This is for consistency with the CSS :nth-child
selectors.val first : 'a nodes -> 'a node option
val last : 'a nodes -> 'a node option
val count : 'a nodes -> int
val index_of : 'a node -> int
val index_of_element : element node -> int
val tags : string -> 'a node -> element nodes
some_root_node |> tags "a"
is a sequence of all a
elements under some_root_node
. It is equivalent to
some_root_node
|> descendants |> elements |> filter (fun e -> name e = "a")
and
some_root_node $$ "a"
Tag names are case-insensitive.
val tag : string -> 'a node -> element node option
tags
, but evaluates to only the first element.val parent : 'a node -> element node option
n |> ancestors |> first
.val is_root : 'a node -> bool
val child : 'a node -> general node option
child node
evaluates to node
's first child. Equivalent to
node |> children |> first
.val child_element : 'a node -> element node option
child_element node
evaluates to node
's first child element. Equivalent
to node |> children |> elements |> first
.val next_sibling : 'a node -> general node option
next_sibling node
is the next sibling of node
in node
's parent's child
list. Equivalent to node |> next_siblings |> first
.val previous_sibling : 'a node -> general node option
next_sibling
, but for the preceding sibling instead.val next_element : 'a node -> element node option
next_element node
is the next sibling of node
that is an element.
Equivalent to n |> next_siblings |> elements |> first
.val previous_element : 'a node -> element node option
next_element
, but for the preceding siblings instead.val no_children : 'a node -> bool
val at_most_one_child : 'a node -> bool
val to_string : 'a node -> string
val pretty_print : 'a node -> string
Note that this can change the whitespace structure of the HTML, so it may
display differently in a browser than the original parsed document.
Lambda Soup uses Markup.ml internally to parse and write markup. If you wish to:
parse
and
to_string
.
See the Markup.ml documentation for
a description of the types involved. The
Markup.ml
overview may be a good place to start.
val signals : 'a node -> (Markup.signal, Markup.sync) Markup.stream
to_string
and pretty_print
.
For example, you can use this function together with
Markup.write_xml
to
output XML, instead of HTML:
soup |> signals |> Markup.write_xml |> Markup.to_string
val from_signals : (Markup.signal, Markup.sync) Markup.stream -> soup node
parse
.
For example, you can use this function together with
Markup.parse_xml
to
load XML into Lambda Soup:
Markup.string s |> Markup.parse_xml |> Markup.signals |> from_signals
Namespaces are ignored at the moment.
val equal : 'a node -> 'b node -> bool
equal n n'
recursively tests the node trees rooted at n
and n'
for
equality. To test true
, the trees must be identical, including whitespace
text nodes. Class attributes and other multi-valued attributes are compared
literally: classes must be listed in the same order, with the same amount of
whitespace in the attribute value. For the purposes of comparison, adjacent
text nodes are merged, and empty text nodes are ignored: this is the
standard HTML normalization procedure.val equal_modulo_whitespace : 'a node -> 'b node -> bool
equal_modulo_whitespace n n'
is like equal n n'
, but all text nodes have
their values passed through String.trim
. Text nodes that become empty are
then ignored for the purpose of comparison, as in equal
.val create_element : ?id:string ->
?class_:string ->
?classes:string list ->
?attributes:(string * string) list ->
?inner_text:string -> string -> element node
create_element tag
creates a new element with the name tag
.
If ~attributes
is specified, the given attributes are added to the
element. ~attributes
defaults to []
.
If ~classes
is specified, the class names are concatenated into a single
string s
and the class
attribute is set on the element to the resulting
value. This takes precedence over ~attributes
.
If ~class
is specified, the class is set on the element. This takes
precedence over both ~attributes
and ~classes
.
If ~id
is specified, the id is set. This takes precedence over
~attributes
.
If ~inner_text
is specified, a text node is created with the given string,
and made the single child of the new element.
val create_text : string -> general node
val create_soup : unit -> soup node
val append_child : element node -> 'a node -> unit
append_child element node
adds node
to the end of the child list of
element
.val prepend_child : element node -> 'a node -> unit
prepend_child element node
adds node
to the beginning of the child list
of element
.val insert_at_index : int -> element node -> 'a node -> unit
insert_at_index k element node
makes node
the k
th child of element
.
Note that the index is 1-based. If k
is outside the range of current valid
indices, node
is inserted at the beginning or end of element
's child
list.val insert_before : 'a node -> 'b node -> unit
insert_before node node'
inserts node'
immediately before node
in
node
's parent's child list.val insert_after : 'a node -> 'b node -> unit
insert_after node node'
inserts node'
immediately after node
in
node
's parent's child list.val delete : 'a node -> unit
val clear : 'a node -> unit
val replace : 'a node -> 'b node -> unit
replace node node'
replaces node
with node'
in node
's parent's child
list. All descendants of node
are implicitly deleted by this operation,
because they become unreachable from node
's parent.val swap : element node -> element node -> unit
swap element element'
replaces element
with element'
in element
's
parent's child list. All children of element
are transferred to
element'
, and all original children of element'
are transferred to
element
.val wrap : 'a node -> element node -> unit
wrap node element
inserts element
in the place of node
, and then makes
node
element
's child. All original children of element
are
unlinked.val unwrap : 'a node -> unit
unwrap node
unlinks node
, and inserts all of node
's children as
children of node
's parent at the former location of node
.val append_root : soup node -> 'a node -> unit
append_root soup node
adds node
as the last root node of soup
.val set_name : string -> element node -> unit
val set_attribute : string -> string -> element node -> unit
set_attribute attr value element
sets the value of attribute attr
on
element
to value
. If the attribute is not present, it is added to
element
. If it is already present, the value is replaced.val delete_attribute : string -> element node -> unit
val add_class : string -> element node -> unit
add_class c element
adds class c
to element
, if element
does not
already have class c
.val remove_class : string -> element node -> unit
remove_class c element
removes class c
from element
, if element
has
class c
.
Lambda Soup is not an I/O library. However, it provides a few simple helpers
based on standard I/O functions in
Pervasives
. These should not be used for "serious" code. They are only
for when you need to get something done quickly, and/or don't care about
corner cases, concurrency, or excellent reliability. In such cases, they
allow you to avoid writing I/O wrappers or using additional libraries.
Using these, you can write little command-line scrapers and filters:
let () =
let soup = read_channel stdin |> parse in
let () = (* ...do things to soup... *) in
soup |> to_string |> write_channel stdout
If the above is compiled to a file scrape
, you can then run
curl -L "http://location.com" | ./scrape
val read_file : string -> string
Sys_error
on failure.val read_channel : Pervasives.in_channel -> string
val write_file : string -> string -> unit
write_file path data
writes data
to the file given by path
. If the
file already exists, it is truncated (erased). If you want to append to
file, use
open_out_gen
with the necessary flags, and pass the resulting channel to
write_channel
. Raises Sys_error
on failure.val write_channel : Pervasives.out_channel -> string -> unit