File Coverage

blib/lib/Text/TEI/Markup.pm
Criterion Covered Total %
statement 13 15 86.6
branch n/a
condition n/a
subroutine 5 5 100.0
pod n/a
total 18 20 90.0


line stmt bran cond sub pod time code
1             package Text::TEI::Markup;
2              
3 1     1   48448 use strict;
  1         2  
  1         36  
4 1     1   4 use vars qw( $VERSION @EXPORT_OK );
  1         1  
  1         45  
5 1     1   5 use Encode;
  1         2  
  1         80  
6 1     1   4 use Exporter 'import';
  1         1  
  1         20  
7 1     1   1621 use XML::LibXML;
  0            
  0            
8              
9             use utf8;
10              
11             $VERSION = '1.9';
12             @EXPORT_OK = qw( &to_xml &word_tag_wrap );
13              
14             =head1 NAME
15              
16             Text::TEI::Markup - a transcription markup syntax for TEI XML
17              
18             =head1 SYNOPSIS
19              
20             use Text::TEI::Markup qw( to_xml );
21             my $xml_string = to_xml( file => $markup_file,
22             template => $template_xml_string,
23             %opts ); # see below for available options
24              
25             use Text::TEI::Markup qw( word_tag_wrap );
26             my $word_wrapped_xml = word_tag_wrap( $tei_xml_string );
27              
28             =head1 DESCRIPTION
29              
30             TEI XML is a wonderful thing. The elements defined therein allow a
31             transcriber to record and represent just about any feature of a text that
32             he or she encounters.
33              
34             The problem is the transcription itself. When I am transcribing a
35             manuscript, especially if that manuscript is in a bunch of funny characters
36             on the keymap for another language, I do not want to be switching back and
37             forth between keyboard layouts in order to type "
38             arrow-arrow-arrow-arrow-arrow " every six seconds. It's prone to
39             typo, it's astonishingly slow, and it makes my wrists hurt just to think
40             about it. I also don't really want to fire up an XML editor, select the
41             words or characters that need to be tagged, and click a lot. That way is
42             not prone to typo, but it's still pretty darn slow, and it makes my wrists
43             hurt B to think about.
44              
45             Text::TEI::Markup is my solution to that problem. It defines a bunch of
46             single- or double-character sigils that represent tags. These are a lot
47             faster and easier to type; I don't have to worry about typos; and I can do
48             it all with a plain text editor, thus minimizing use of the mouse.
49              
50             I have tried to pick sigils that don't conflict with characters that are
51             found in manuscripts. I have succeeded for my particular set of
52             manuscripts, but I have not succeeded for the general case. If you like the
53             idea behind this module, you are still almost guaranteed to hate the sigils
54             I've picked. That's okay; you can re-define them.
55              
56             =head2 Extra bonus solution: word wrapping with and
57              
58             Even if you are happy as a clam in the graphical XML editor of your choice,
59             this module exports a function that may be useful to you. The TEI P5
60             guidelines include a module called "analysis", which allows the user to tag
61             sentences, clauses, words, morphemes, or any other sort of semantic segment
62             of a text. This is really good for programmatic applications, but very
63             boring and repetitive to have to tag.
64              
65             The function B solves part of this problem for you. It takes
66             an XML string as input, looks for words (defined by whitespace separation)
67             and returns an XML string with each of these words wrapped in an
68             appropriate tag. If the word has complex elements (e.g. editorial
69             expansion), it will be wrapped in a 70             be in a simple tag. It handles line breaks and page breaks within
71             words, as long as there is no trailing whitespace before the (or
72             ) tag, and as long as the whitespace after the tag contains a carriage
73             return.
74              
75             =head1 MARKUP SYNTAX
76              
77             The input file has a header and a body. The header begins with a '=HEAD'
78             tag, and consists of a colon-separated list of key_value pairs. These keys,
79             which are case insensitive, get directly substituted into an XML template;
80             the idea is that your TEI header won't change very much between files, so
81             you write it once with template values, pass it to &to_xml, and the
82             substitution happens as if by magic. The keyword /MAIN/i is reserved for
83             the content between the tags - that is, all the content that
84             will be generated after the '=BODY' tag.
85              
86             A very simple template looks like this:
87              
88            
89            
90            
91            
92            
93             __TITLE__
94            
95            
96             Transcription by
97             __MYNAME__
98            
99            
100            
101            
102            
103            
104             __MAIN__
105            
106            
107            
108              
109             Your input file should then begin something like this:
110              
111             =HEAD
112             title:My Summer Vacation: a novel
113             author:John Smith
114             myinitials:tla
115             myname:Tara L Andrews
116             =BODY
117             The ^real^ text b\e\gins +(above)t+here.
118             ...
119              
120              
121             The real work begins after the '=BODY' tag. The currently-defined sigil
122             list is:
123              
124             %SIGILS = (
125             'comment' => '##',
126             'add' => '+',
127             'del' => '-',
128             'subst' => "\x{b1}", # Unicode PLUS-MINUS SIGN
129             'div' => "\x{a7}", # Unicode SECTION SIGN
130             'p' => "\x{b6}", # Unicode PILCROW SIGN
131             'ex' => '\\',
132             'expan' => '^',
133             'supplied' => '@',
134             'abbr' => [ '{', '}' ],
135             'num' => '%',
136             'pb' => [ '[', ']' ],
137             'cb' => '|',
138             'hi' => '*',
139             'unclear' => '?',
140             'q' => "\x{2020}", # Unicode DAGGER
141             );
142              
143             Non-identical matched sets of sigla (e.g. '{}' for abbreviations) should be
144             specified in a listref, as seen here.
145              
146             Whitespace is only significant at the end of lines. If a line which
147             contains non-tag text (i.e. words) ends in whitespace, it is assumed that
148             the previous word is a complete word. If the line ends with a
149             non-whitespace character, it is assume that the word continues onto the
150             next line.
151              
152             All the sigils must be balanced, and they must nest properly. Remember that
153             this is a shorthand for XML. I could be convinced to try to autocorrect
154             some unbalanced sigils, but it would be worth at least a few pints of cider
155             (or, of course, a patch.)
156              
157             =head2 Tag arguments
158              
159             Certain of the tags can be passed extra arguments:
160              
161             =over 4
162              
163             =item C
164              
165             Anything that appears in parentheses immediately after the add/del opening
166             sigil ( + or - in the examples above) will get added as an attribute. If
167             the string in parentheses has no '=' sign in it, the attribute for the
168             "add" tag will be "place", and the attribute for the "del" tag will be
169             "type". Ergo:
170              
171             +(margin)This is an addition+
172             -(overwrite)and a deletion- to the sentence.
173              
174             will get translated to
175              
176             This is an addition
177             and a deletion to the sentence.
178              
179             This behavior ought to be more configurable and/or flexible; make it worth
180             my while.
181              
182             =item C
183              
184             A number value can calculated using a number_conversion function, or it can
185             simply be specified. It is also possible to specify the type of number being
186             represented (Binal, Binal, Btion, Bentage). The arguments
187             are separated with a comma, and in the order "value", "type". So for example:
188              
189             The lead was taken by the Exeter %(8)VIII%. This was their
190             %(13,ord)thirteenth% straight win.
191              
192             will become:
193              
194             The lead was taken by the Exeter VIII. This was their
195             thirteenth straight win.
196              
197             =item C
198              
199             When text highlighting is encoded, it is almost always a good idea to say
200             something about how the highlight was rendered. This information can be passed
201             as an argument:
202              
203             *(red)IN the beginning* was the word
204            
205             will become
206              
207             IN the beginning was the word
208            
209             =back
210              
211             =head1 SUBROUTINES
212              
213             =over 4
214              
215             =item B( file => '$filename', %opts );
216              
217             Takes the name of a file that holds a marked-up version of text. Returns a
218             TEI XML string to represent that text. Options include:
219              
220             =over 4
221              
222             =item C