module Markup:sig
..end
Markup.ml is an HTML and XML parsing and serialization library. It:
Markup_lwt
).tree
) allows that to be easily converted into
DOM-style trees.
open Markup
(* Correct and pretty-print HTML. *)
channel stdin
|> parse_html |> signals |> pretty_print
|> write_html |> to_channel stdout
(* Show up to 10 XML well-formedness errors to the user. Stop after
the 10th, without reading more input. *)
let report =
let count = ref 0 in
fun location error ->
error |> Error.to_string ~location |> prerr_endline;
count := !count + 1;
if !count >= 10 then raise_notrace Exit
string "some xml" |> parse_xml ~report |> signals |> drain
(* Load HTML into a custom document tree data type. *)
type html = Text of string | Element of string * html list
file "some_file"
|> fst
|> parse_html
|> signals
|> tree
~text:(fun ss -> Text (String.concat "" ss))
~element:(fun (_, name) _ children -> Element (name, children))
The interface is centered around four functions. In pseudocode:
val parse_html : char stream -> signal stream
val write_html : signal stream -> char stream
val parse_xml : char stream -> signal stream
val write_xml : signal stream -> char stream
Most of the remaining functions create streams from, or write streams to,
strings, files, and channels, or manipulate streams, such as next
and the
combinators map
and fold
.
Apart from this module, Markup.ml provides two other top-level modules:
Markup_lwt |
Lwt interface to Markup.ml.
|
Markup_lwt_unix |
Stream functions based on
Lwt_io .
|
Most of the interface of Markup_lwt
is specified in signature
ASYNCHRONOUS
, which will be shared with a Markup_async
module, should
it be implemented.
Markup.ml is developed on GitHub
and distributed under the
BSD
license. This documentation is for version 0.7.5 of the library.
Documentation for older versions can be found on the
releases page.
type
async
type
sync
('a, 's) stream
in place of 's
. See
explanation below.type ('a, 's)
stream
'a
.
In simple usage, when using only this module Markup
, the additional type
parameter 's
is always sync
, and there is no need to consider it
further.
However, if you are using Markup_lwt
, you may create some async
streams. The difference between the two is that next
on a sync
stream
retrieves an element before next
"returns," while next
on an async
stream might not retrieve an element until later. As a result, it is not
safe to pass an async
stream where a sync
stream is required. The
phantom types are used to make the type checker catch such errors at compile
time.
The parsers recover from errors automatically. If that is sufficient, you
can ignore this section. However, if you want stricter behavior, or need to
debug parser output, use optional argument ?report
of the parsers, and
look in module Error
.
typelocation =
int * int
module Error:sig
..end
to_string
function.
The parsers detect encodings automatically. If you need to specify an
encoding, use optional argument ?encoding
of the parsers, and look in
module Encoding
.
module Encoding:sig
..end
typename =
string * string
type
xml_declaration = {
|
version : |
|
encoding : |
|
standalone : |
<?xml version="1.0" encoding="utf-8"?>
.type
doctype = {
|
doctype_name : |
|
public_identifier : |
|
system_identifier : |
|
raw_text : |
|
force_quirks : |
raw_text
. The XML parser reads declarations roughly, and
fills only the raw_text
field with the text found in the declaration.typesignal =
[ `Comment of string
| `Doctype of doctype
| `End_element
| `PI of string * string
| `Start_element of name * (name * string) list
| `Text of string list
| `Xml of xml_declaration ]
doc ::= `Xml? misc* `Doctype? misc* element misc*
misc ::= `PI | `Comment
element ::= `Start_element content* `End_element
content ::= `Text | element | `PI | `Comment
As a result, emitted `Start_element
and `End_element
signals are always
balanced, and, if there is an XML declaration, it is the first signal.
If parsing with ~context:`Document
, the signal sequence will match the
doc
production until the first error. If parsing with
~context:`Fragment
, it will match content*
. If ~context
is not
specified, the parser will pick one of the two by examining the input.
As an example, if the XML parser is parsing
<?xml version="1.0"?><root>text<nested>more text</nested></root>
it will emit the signal sequence
`Xml {version = "1.0"; encoding = None; standalone = None}
`Start_element (("", "root"), [])
`Text ["text"]
`Start_element (("", "nested"), [])
`Text ["more text"]
`End_element
`End_element
The `Text
signal carries a string list
instead of a single string
because on 32-bit platforms, OCaml strings cannot be larger than 16MB. In
case the parsers encounter a very long sequence of text, one whose length
exceeds about Sys.max_string_length / 2
, they will emit a `Text
signal
with several strings.
typecontent_signal =
[ `End_element
| `Start_element of name * (name * string) list
| `Text of string list ]
signal
to only elements and text, i.e. no comments,
processing instructions, or declarations. This can be useful for pattern
matching in applications that only care about the content and element
structure of a document. See the helper content
.val signal_to_string : [< signal ] -> string
type 's
parser
's parser
is a thin wrapper around a (signal, 's) stream
that
supports access to additional information that is not carried directly in
the stream, such as source locations.val signals : 's parser -> (signal, 's) stream
val location : 'a parser -> location
(1, 1)
.val parse_xml :
?report:(location -> Error.t -> unit) ->
?encoding:Encoding.t ->
?namespace:(string -> string option) ->
?entity:(string -> string option) ->
?context:[< `Document | `Fragment ] ->
(char, 's) stream -> 's parser
For simple usage, string "foo" |> parse_xml |> signals
.
If ~report
is provided, report
is called for every error encountered.
You may raise an exception in report
, and it will propagate to the code
reading the signal stream.
If ~encoding
is not specified, the parser detects the input encoding
automatically. Otherwise, the given encoding is used.
~namespace
is called when the parser is unable to resolve a namespace
prefix. If it evaluates to Some s
, the parser maps the prefix to s
.
Otherwise, the parser reports `Bad_namespace
.
~entity
is called when the parser is unable to resolve an entity
reference. If it evaluates to Some s
, the parser inserts s
into the
text or attribute being parsed without any further parsing of s
. s
is
assumed to be encoded in UTF-8. If entity
evaluates to None
instead,
the parser reports `Bad_token
. See xhtml_entity
if you are parsing
XHTML.
The meaning of ~context
is described at signal
, above.
val write_xml :
?report:(signal * int -> Error.t -> unit) ->
?prefix:(string -> string option) ->
([< signal ], 's) stream -> (char, 's) stream
If ~report
is provided, it is called for every error encountered. The
first argument is a pair of the signal causing the error and its index in
the signal stream. You may raise an exception in report
, and it will
propagate to the code reading the byte stream.
~prefix
is called when the writer is unable to find a prefix in scope
for a namespace URI. If it evaluates to Some s
, the writer uses s
for
the URI. Otherwise, the writer reports `Bad_namespace
.
val parse_html :
?report:(location -> Error.t -> unit) ->
?encoding:Encoding.t ->
?context:[< `Document | `Fragment of string ] ->
(char, 's) stream -> 's parser
parse_xml
, but parses HTML with embedded SVG and MathML, never
emits signals `Xml
or `PI
, and ~context
has a different type on tag
`Fragment
.
For HTML fragments, you should specify the enclosing element, e.g.
`Fragment "body"
. This is because, when parsing HTML, error recovery and
the interpretation of text depend on the current element. For example, the
text
foo</bar>
parses differently in title
elements than in p
elements. In the former,
it is parsed as foo</bar>
, while in the latter, it is foo
followed by a
parse error due to unmatched tag </bar>
. To get these behaviors, set
~context
to `Fragment "title"
and `Fragment "p"
, respectively.
If you use `Fragment "svg"
, the fragment is assumed to be SVG markup.
Likewise, `Fragment "math"
causes the parser to parse MathML markup.
If ~context
is omitted, the parser guesses it from the input stream. For
example, if the first signal would be `Doctype
, the context is set to
`Document
, but if the first signal would be `Start_element "td"
, the
context is set to `Fragment "tr"
. If the first signal would be
`Start_element "g"
, the context is set to `Fragment "svg"
.
val write_html : ([< signal ], 's) stream -> (char, 's) stream
write_xml
, but emits HTML5 instead of XML.val string : string -> (char, sync) stream
val buffer : Buffer.t -> (char, sync) stream
val channel : Pervasives.in_channel -> (char, sync) stream
Sys_error
.
Note that this input source is synchronous because Pervasives.in_channel
reads are blocking. For non-blocking channels, see Markup_lwt_unix
.
val file : string -> (char, sync) stream * (unit -> unit)
file path
opens the file at path
, then evaluates to a pair s, close
,
where reading from stream s
retrieves successive bytes from the file, and
calling close ()
closes the file.
The file is closed automatically if s
is read to completion, or if reading
s
raises an exception. It is not necessary to call close ()
in these
cases.
If the file cannot be opened, raises Sys_error
immediately. If the file
cannot be read, reading the stream raises Sys_error
.
val fn : (unit -> char option) -> (char, sync) stream
fn f
is a stream that retrives bytes by calling f ()
. If the call
results in Some c
, the stream emits c
. If the call results in None
,
the stream is considered to have ended.
This is actually an alias for stream
, restricted to type char
.
val to_string : (char, sync) stream -> string
val to_buffer : (char, sync) stream -> Buffer.t
val to_channel : Pervasives.out_channel -> (char, sync) stream -> unit
Sys_error
.val to_file : string -> (char, sync) stream -> unit
Sys_error
.
Note that the file is truncated (cleared) before writing. If you wish to
append to file, open it with the appropriate flags and use to_channel
on
the resulting channel.val stream : (unit -> 'a option) -> ('a, sync) stream
stream f
creates a stream that repeatedly calls f ()
. Each time f ()
evaluates to Some v
, the next item in the stream is v
. The first time
f ()
evaluates to None
, the stream ends.val next : ('a, sync) stream -> 'a option
val peek : ('a, sync) stream -> 'a option
val transform :
('a -> 'b -> 'c list * 'a option) ->
'a -> ('b, 's) stream -> ('c, 's) stream
transform f init s
lazily creates a stream by repeatedly applying
f acc v
, where acc
is an accumulator whose initial value is init
, and
v
is consecutive values of s
. Each time, f acc v
evaluates to a pair
(vs, maybe_acc')
. The values vs
are added to the result stream. If
maybe_acc'
is Some acc'
, the accumulator is set to acc'
. Otherwise, if
maybe_acc'
is None
, the result stream ends.val fold : ('a -> 'b -> 'a) -> 'a -> ('b, sync) stream -> 'a
fold f init s
eagerly folds over the items v
, v'
, v''
, ... of s
,
i.e. evaluates f (f (f init v) v') v''
...val map : ('a -> 'b) -> ('a, 's) stream -> ('b, 's) stream
map f s
lazily applies f
to each item of s
, and produces the resulting
stream.val filter : ('a -> bool) -> ('a, 's) stream -> ('a, 's) stream
filter f s
is s
without the items for which f
evaluates to false
.
filter
is lazy.val filter_map : ('a -> 'b option) -> ('a, 's) stream -> ('b, 's) stream
filter_map f s
lazily applies f
to each item v
of s
. If f v
evaluates to Some v'
, the result stream has v'
. If f v
evaluates to
None
, no item corresponding to v
appears in the result stream.val iter : ('a -> unit) -> ('a, sync) stream -> unit
iter f s
eagerly applies f
to each item of s
, i.e. evaluates
f v; f v'; f v''
...val drain : ('a, sync) stream -> unit
drain s
eagerly consumes s
. This is useful for observing side effects,
such as parsing errors, when you don't care about the parsing signals
themselves. It is equivalent to iter ignore s
.val of_list : 'a list -> ('a, sync) stream
val to_list : ('a, sync) stream -> 'a list
val content :
([< signal ], 's) stream ->
(content_signal, 's) stream
signal
stream into a content_signal
stream by filtering out
all signals besides `Start_element
, `End_element
, and `Text
.val tree :
?text:(string list -> 'a) ->
?element:(name -> (name * string) list -> 'a list -> 'a) ->
?comment:(string -> 'a) ->
?pi:(string -> string -> 'a) ->
?xml:(xml_declaration -> 'a) ->
?doctype:(doctype -> 'a) ->
([< signal ], sync) stream -> 'a option
type my_dom = Text of string | Element of name * my_dom list
"<p>HTML5 is <em>easy</em> to parse"
|> string
|> parse_html
|> signals
|> tree
~text:(fun ss -> Text (String.concat "" ss))
~element:(fun (name, _) children -> Element (name, children))
results in the structure
Element ("p" [
Text "HTML5 is ";
Element ("em", [Text "easy"]);
Text " to parse"])
Formally, tree
assembles a tree data structure of type 'a
from a signal
stream. The stream is parsed according to the following grammar:
stream ::= node*
node ::= element | `Text | `Comment | `PI | `Xml | `Doctype
element ::= `Start_element node* `End_element
Each time trees
matches a production of node
, it calls the corresponding
function to convert the node into your tree type 'a
. For example, when
trees
matches `Text ss
, it calls ~text ss
, if ~text
is supplied.
Similarly, when trees
matches element
, it calls
~element name attributes children
, if ~element
is supplied.
See trees
if the input stream might have multiple top-level trees. This
function tree
only retrieves the first one.
val trees :
?text:(string list -> 'a) ->
?element:(name -> (name * string) list -> 'a list -> 'a) ->
?comment:(string -> 'a) ->
?pi:(string -> string -> 'a) ->
?xml:(xml_declaration -> 'a) ->
?doctype:(doctype -> 'a) ->
([< signal ], 's) stream -> ('a, 's) stream
tree
, but converts all top-level trees, not only the first one. The
trees are emitted on the resulting stream, in the sequence that they appear
in the input.type'a
node =
[ `Comment of string
| `Doctype of doctype
| `Element of name * (name * string) list * 'a list
| `PI of string * string
| `Text of string
| `Xml of xml_declaration ]
from_tree
below.val from_tree : ('a -> 'a node) -> 'a -> (signal, sync) stream
'a
into signal streams. The
function argument is applied to each data structure node. For example,
type my_dom = Text of string | Element of string * my_dom list
let dom =
Element ("p", [
Text "HTML5 is ";
Element ("em", [Text "easy"]);
Text " to parse"])
dom |> from_tree (function
| Text s -> `Text s
| Element (name, children) -> `Element (("", name), [], children))
results in the signal stream
`Start_element (("", "p"), [])
`Text ["HTML5 is "]
`Start_element (("", "em"), [])
`Text ["easy"]
`End_element
`Text " to parse"
`End_element
val elements :
(name -> (name * string) list -> bool) ->
([< signal ] as 'a, 's) stream ->
(('a, 's) stream, 's) stream
elements f s
scans the signal stream s
for
`Start_element (name, attributes)
signals that satisfy
f name attributes
. Each such matching signal is the beginning of a
substream that ends with the corresponding `End_element
signal. The result
of elements f s
is the stream of these substreams.
Matches don't nest. If there is a matching element contained in another matching element, only the top one results in a substream.
Code using elements
does not have to read each substream to completion, or
at all. However, once the using code has tried to get the next substream, it
should not try to read a previous one.
val text : ([< signal ], 's) stream -> (char, 's) stream
`Text ss
signal, the result stream has the bytes of the strings ss
, and
all other signals are ignored.val trim : ([> `Text of string list ] as 'a, 's) stream -> ('a, 's) stream
`Text ss
, transforms
ss
so that the result strings ss'
satisfy
String.concat "" ss' = String.trim (String.concat "" ss)
All signals for which String.concat "" ss' = ""
are then dropped.
val normalize_text : ([> `Text of string list ] as 'a, 's) stream -> ('a, 's) stream
`Text
signals, then eliminates all empty strings,
then all `Text []
signals. Signals besides `Text
are unaffected. Note
that signal streams emitted by the parsers already have normalized text.
This function is useful when you are inserting text into a signal stream
after parsing, or generating streams from scratch, and would like to clean
up the `Text
signals.val pretty_print :
([> content_signal ] as 'a, 's) stream ->
('a, 's) stream
`Text
signals in the given stream so that
the output appears nicely-indented when the stream is converted to bytes and
written.val html5 : ([< signal ], 's) stream -> (signal, 's) stream
val xhtml :
?dtd:[< `Frameset_1_0 | `Strict_1_0 | `Strict_1_1 | `Transitional_1_0 ] ->
([< signal ], 's) stream -> (signal, 's) stream
html5
, but does not strip processing instructions, and
prefixes an XHTML document type declaration and an XML declaration. The
~dtd
argument specifies which DTD to refer to in the doctype declaration.
The default is `Strict_1_1
.val xhtml_entity : string -> string option
~entity
argument of parse_xml
when parsing XHTML.val strings_to_bytes : (string, 's) stream -> (char, 's) stream
strings_to_bytes s
is the stream of all the bytes of all strings in
s
.val compare_locations : location -> location -> int
module Ns:sig
..end
module type ASYNCHRONOUS =sig
..end
The HTML parser seeks to implement section 8 of the HTML5 specification. That section describes a parser, part of a full-blown user agent, that is building up a DOM representation of an HTML document. Markup.ml is neither inherently part of a user agent, nor does it build up a DOM representation. With respect to section 8 of HTML5, Markup.ml is concerned with only the syntax. When that section requires that the user agent perform an action, Markup.ml emits enough information for a hypothetical user agent based on it to be able to decide to perform this action. Likewise, Markup.ml seeks to emit enough information for a hypothetical user agent to build up a conforming DOM.
The XML parser seeks to be a non-validating implementation of the XML and Namespaces in XML specifications.
This rest of this section lists known deviations from HTML5, XML, and Namespaces in XML. Some of these deviations are meant to be corrected in future versions of Markup.ml, while others will probably remain. The latter satisfy some or all of the following properties:
p
elements and opening of table
elements.form
tags in some
situations.form
and template
.isindex
is completely ignored. isindex
is handled as an unknown element.<meta>
tags for encoding declarations. The user of Markup.ml should read
these, if necessary. They are part of the emitted signal stream.noscript
elements are always parsed, as are script
elements. For
conforming behavior, if the user of Markup.ml "supports scripts," the user
should serialize the content of noscript
to a `Text
signal using
write_html
.title
that belong in head
, but are found
between head
and body
, are not moved into head
.<html>
tags found in the body do not have their attributes added
to the `Start_element "html"
signal emitted at the beginning of the
document.