File Coverage

blib/lib/File/TTX.pm
Criterion Covered Total %
statement 7 9 77.7
branch n/a
condition n/a
subroutine 3 3 100.0
pod n/a
total 10 12 83.3


line stmt bran cond sub pod time code
1             package File::TTX;
2            
3 1     1   22600 use warnings;
  1         2  
  1         30  
4 1     1   4 use strict;
  1         1  
  1         30  
5 1     1   386 use XML::Snap;
  0            
  0            
6             use POSIX qw/strftime/;
7            
8             =head1 NAME
9            
10             File::TTX - Utilities for dealing with TRADOS TTX files
11            
12             =head1 VERSION
13            
14             Version 0.04
15            
16             =cut
17            
18             our $VERSION = '0.04';
19            
20            
21             =head1 SYNOPSIS
22            
23             TRADOS has been more or less the definitive set of translation tools for over a decade; more to the point, they're the
24             tools I use most. There are two basic modes used by TRADOS to interact with documents. The first is in Word documents, which
25             is not addressed in this module. The second is with TagEditor, which has TTX files as its native file format. TTX files are
26             a breed of XML, so they're actually pretty easy to work with.
27            
28             use File::TTX;
29            
30             my $foo = File::TTX->load('myfile.ttx');
31             ... do stuff with it ...
32             $foo->write();
33            
34             Each TTX consists of a header and body text. The header contains various information about the file you can read and write;
35             the text is, well, the text of the document. Before translation, the text consists of just plain text, but as you work TagEditor
36             I the file into segments, each of which is translated in isolation. (The paradigm here is that if you re-encounter a
37             segment or something similar to one you've already done, the translation memory will provide you with the translation, either
38             automatically writing it if it's identical, or at least presenting it to you to speed things up if it's just similar.)
39            
40             A common mode is to read things with a script, build a TTX, and write it out for translation with TagEditor. Here's the kind
41             of functions you'd use for that:
42            
43             use File::TTX;
44            
45             my $ttx = File::TTX->new();
46            
47             $ttx->append_text("This is a sentence.\n");
48             $ttx->append_mark("test mark");
49             $ttx->append_text("\n");
50             $ttx->append_text("This is another sentence.\n");
51            
52             $ttx->write ("my.ttx");
53            
54             After translation, you can use the marks to find out where you are in the file (they'll be skipped during translation without
55             being removed from the file).
56            
57             There are two basic modes for content extraction; either you want to scan all content, or you're just interested in the segments
58             so you can toss them into an Excel spreadsheet or something. These work pretty much the same; to scan all elements, you use
59             C as follows; it returns a list of C elements, documented below, which are really just
60             C elements with a little extra sugar for convenience.
61            
62             use File::TTX;
63             my $ttx = File::TTX->load('myfile.ttx');
64            
65             foreach my $piece ($ttx->content_elements) {
66             if ($piece->type eq 'mark') {
67             # something
68             } else {
69             print $piece->translated . "\n";
70             }
71             }
72            
73             To do a more data-oriented extraction, you'd want the C function, and the loop would look more like this:
74            
75             foreach my $s ($ttx->segments) {
76             print $s->source . " - " . $s->translated . "\n";
77             }
78            
79             Clear? Sure it is.
80            
81             Here's another example: a filter to strip all pre-translated content out of a TTX in case you want a new, un-pre-translated copy.
82            
83             use File::TTX;
84            
85             my $in = $ARGV[0];
86             my $outf = $in;
87             $outf =~ s/\.xls\.ttx$/-stripped.xls.ttx/;
88            
89             my $ttx = File::TTX->load($in);
90             my $out = File::TTX->new(from=>$ttx);
91            
92             foreach my $piece ($ttx->content_elements) {
93             $out->append_copy ($piece->source_xml);
94             }
95            
96             $out->write($outf);
97            
98             It should be easy to see how you can expand that filter idea into nearly anything you need.
99            
100             There are still plenty of gaps in this API! I plan to extend it as I run into new use cases. I'd be overjoyed to hear about yours.
101            
102             =head1 CREATING A TTX OBJECT
103            
104             =head2 new()
105            
106             The C function creates a blank TTX so you can build whatever you want and write it out. If you've already got an XML::Snap
107             structure (that's the library used internally for XML representation here) then you can pass it in and it will be broken down
108             into useful structural components for the element access functions.
109            
110             =cut
111            
112             sub new {
113             my ($class, %input) = @_;
114             my $self = bless {}, $class;
115             if ($input{'xml'}) {
116             $self->{xml} = $input{'xml'};
117             } else {
118             $self->{xml} = XML::Snap->parse ('');
119             }
120             $self->{file} = $input{'file'};
121             $self->{'frontmatter'} = $self->{xml}->first ('FrontMatter');
122             $self->{'toolsettings'} = $self->{frontmatter}->first ('ToolSettings');
123             $self->{'usersettings'} = $self->{frontmatter}->first ('UserSettings');
124             $self->{'body'} = $self->{xml}->first ('Raw');
125            
126             if ($input{'from'}) {
127             $self->copy_header ($input{'from'});
128             return $self;
129             }
130            
131             my $lookup = sub {
132             my ($field, $where, $default) = @_;
133             return $input{$field} if $input{$field};
134             return $self->{$where}->get ($field, $default);
135             };
136            
137             $self->{toolsettings}->set ('CreationTool', $lookup->('CreationTool', 'toolsettings', 'perl with File::TTX'));
138             $self->{toolsettings}->set ('CreationDate', $lookup->('CreationDate', 'toolsettings', $self->date_now));
139             $self->{toolsettings}->set ('CreationToolVersion', $lookup->('CreationToolVersion', 'toolsettings', $VERSION));
140            
141             $self->{usersettings}->set ('SourceDocumentPath', $lookup->('SourceDocumentPath', 'usersettings', ''));
142             $self->{usersettings}->set ('O-Encoding', $lookup->('O-Encoding', 'usersettings', 'windows-1252'));
143             $self->{usersettings}->set ('TargetLanguage', $lookup->('TargetLanguage', 'usersettings', 'EN-US'));
144             $self->{usersettings}->set ('PlugInInfo', $lookup->('PlugInInfo', 'usersettings', ''));
145             $self->{usersettings}->set ('SourceLanguage', $lookup->('SourceLanguage', 'usersettings', 'DE-DE'));
146             $self->{usersettings}->set ('SettingsPath', $lookup->('SettingsPath', 'usersettings', ''));
147             $self->{usersettings}->set ('SettingsRelativePath',$lookup->('SettingsRelativePath','usersettings', ''));
148             $self->{usersettings}->set ('DataType', $lookup->('DataType', 'usersettings', 'RTF'));
149             $self->{usersettings}->set ('SettingsName', $lookup->('SettingsName', 'usersettings', ''));
150             $self->{usersettings}->set ('TargetDefaultFont', $lookup->('TargetDefaultFont', 'usersettings', ''));
151            
152             return $self;
153             }
154            
155             =head2 load()
156            
157             The C function loads an existing TTX. Said file will remember where it came from, so you don't have to give the
158             filename again when you write it (assuming you write it, of course).
159            
160             TRADOS is nice enough to provide us with TTX that is illegal XML sometimes, so load() has to load your entire file into memory to
161             sanitize it of illegal characters before the XML parser sees it. This will unfortunately cause File::TTX to work from a different input
162             from TRADOS native tools, but as long as your TTX isn't generated from a Word document with soft hyphens in it, you ought to be OK.
163            
164             =cut
165            
166             sub load {
167             my ($class, $file) = @_;
168             my $xml = XML::Snap->load($file);
169             $xml->bless_text;
170             return $class->new(xml => $xml, file=>$file);
171             }
172            
173             =head1 FILE MANIPULATION
174            
175             =head2 write($file)
176            
177             Writes a TTX out to disk; the C<$file> can be omitted if you used C to make the object and you want the file to write
178             to the same place.
179            
180             =cut
181            
182             sub write {
183             my ($self, $fname) = @_;
184             $fname = $self->{file} unless $fname;
185            
186             my $file;
187             open $file, ">:raw:encoding(UCS-2LE):crlf:utf8", $fname or croak $!;
188             print $file "\x{FEFF}"; # This is the byte order marker; Perl would do this for us, apparently, if we hadn't
189             # explicitly specified the UCS-2LE encoding.
190             print $file "\n";
191             $self->{xml}->writestream($file);
192            
193             #$self->{xml}->write_UCS2LE($file);
194             }
195            
196            
197            
198             =head1 HEADER ACCESS
199            
200             Here are a bunch of functions to access and/or modify different things in the header. Pass any of them a value to set that
201             value.
202            
203             =head2 CreationTool(), CreationDate(), CreationToolVersion()
204            
205             These are in the ToolSettings part of the header. Mostly you don't care about them.
206            
207             =cut
208            
209             sub CreationTool { $_[0]->{toolsettings}->set ('CreationTool', $_[1]) }
210             sub CreationDate { $_[0]->{toolsettings}->set ('CreationDate', $_[1]) }
211             sub CreationToolVersion { $_[0]->{toolsettings}->set ('CreationToolVersion', $_[1]) }
212            
213             =head2 SourceDocumentPath(), OEncoding(), TargetLanguage(), PlugInInfo(), SourceLanguage(), SettingsPath(), SettingsRelativePath(), DataType(), SettingsName(), TargetDefaultFont()
214            
215             These are in the UserSettings part of the header. Frankly, mostly you don't care about these either, but here we're getting
216             into the reason for this module, like writing a quick script to read or change the source and target languages of TTX files.
217            
218             =cut
219            
220             sub SourceDocumentPath { $_[0]->{usersettings}->set ('SourceDocumentPath', $_[1]) }
221             sub OEncoding { $_[0]->{usersettings}->set ('O-Encoding', $_[1]) }
222             sub TargetLanguage { $_[0]->{usersettings}->set ('TargetLanguage', $_[1]) }
223             sub PlugInInfo { $_[0]->{usersettings}->set ('PlugInInfo', $_[1]) }
224             sub SourceLanguage { $_[0]->{usersettings}->set ('SourceLanguage', $_[1]) }
225             sub SettingsPath { $_[0]->{usersettings}->set ('SettingsPath', $_[1]) }
226             sub SettingsRelativePath { $_[0]->{usersettings}->set ('SettingsRelativePath', $_[1]) }
227             sub DataType { $_[0]->{usersettings}->set ('DataType', $_[1]) }
228             sub SettingsName { $_[0]->{usersettings}->set ('SettingsName', $_[1]) }
229             sub TargetDefaultFont { $_[0]->{usersettings}->set ('TargetDefaultFont', $_[1]) }
230            
231             =head2 copy_header ($source)
232            
233             Copies the header information from another TTX into this one.
234            
235             =cut
236             sub copy_header {
237             my ($self, $source) = @_;
238            
239             $self->CreationTool ($source->CreationTool);
240             $self->CreationDate ($source->CreationDate);
241             $self->CreationToolVersion ($source->CreationToolVersion);
242            
243             $self->SourceDocumentPath ($source->SourceDocumentPath);
244             $self->OEncoding ($source->OEncoding);
245             $self->TargetLanguage ($source->TargetLanguage);
246             $self->PlugInInfo ($source->PlugInInfo);
247             $self->SourceLanguage ($source->SourceLanguage);
248             $self->SettingsPath ($source->SettingsPath);
249             $self->SettingsRelativePath ($source->SettingsRelativePath);
250             $self->DataType ($source->DataType);
251             $self->SettingsName ($source->SettingsName);
252             $self->TargetDefaultFont ($source->TargetDefaultFont);
253             }
254            
255             =head2 slang(), tlang()
256            
257             These are quicker versions of SourceLanguage and TargetLanguage; they cache the values for repeated use (and they do get used
258             repeatedly). The drawback is they're actually slower for files without a source or target language defined, but this actually
259             doesn't happen all that often. At least I hope not.
260            
261             =cut
262            
263             sub slang {
264             my ($self, $l) = @_;
265             if (defined $l) {
266             $self->{slang} = $self->SourceLanguage($l);
267             return $self->{slang};
268             }
269             return $self->{slang} if $self->{slang};
270             $self->{slang} = $self->SourceLanguage();
271             $self->{slang};
272             }
273             sub tlang {
274             my ($self, $l) = @_;
275             if (defined $l) {
276             $self->{tlang} = $self->TargetLanguage($l);
277             return $self->{tlang};
278             }
279             return $self->{tlang} if $self->{tlang};
280             $self->{tlang} = $self->TargetLanguage();
281             $self->{tlang};
282             }
283            
284             =head1 WRITING TO THE BODY
285            
286             =head2 append_text($string)
287            
288             Append a string to the end of the body. It's the caller's responsibility to terminate the line.
289            
290             =cut
291            
292             sub append_text {
293             my ($self, $str) = @_;
294             $self->{body}->add (\$str);
295             }
296            
297             =head2 append_segment($source, $target, $match, $slang, $tlang, $origin)
298            
299             Appends a segment to the body. Only C<$source> and C<$target> are required; C<$match> defaults to 0, and defaults for C<$slang>
300             and C<$tlang> (the source and target languages) default to the master values in the header. Note that TagEditor I doesn't
301             like you to mix languages, but who am I to stand in your way in this matter? Finally, C<$origin> defaults to unspecified.
302             TagEditor sets it to "manual"; probably "Align" is another value, but I haven't verified that.
303            
304             If the header doesn't actually have a source or target language, and you specify one or the other here, it will be written to
305             the header as the default source or target language.
306            
307             =cut
308            
309             sub append_segment {
310             my ($self, $source, $target, $match, $slang, $tlang, $origin) = @_;
311            
312             $match = 0 unless $match;
313            
314             if ($slang) {
315             my $lang = $self->slang;
316             $self->slang($slang) unless $lang;
317             } else {
318             $slang = $self->slang;
319             }
320             if ($tlang) {
321             my $lang = $self->tlang;
322             $self->tlang($tlang) unless $lang;
323             } else {
324             $tlang = $self->tlang;
325             }
326            
327             $source = XML::Snap->escape ($source);
328             $target = XML::Snap->escape ($target);
329             my $tu = XML::Snap->parse ("");
330             $tu->set ('origin', $origin) if defined $origin;
331             $tu->append (XML::Snap->parse ("$source"));
332             $tu->append (XML::Snap->parse ("$target"));
333            
334             $self->{body}->add ($tu);
335             }
336            
337             =head2 append_mark($string, $tag)
338            
339             Appends a non-opening, non-closing tag to the body. (External style, e.g. text in Word that doesn't get translated.)
340             This is useful for setting marks for script coordination, which is why I call it append_mark.
341            
342             The default appearance is "text", but you can add C<$tag> if you want something else.
343            
344             =cut
345            
346             sub append_mark {
347             my ($self, $text, $tag) = @_;
348             $tag = 'text' unless $tag;
349             $text = XML::Snap->escape($text);
350             my $mark = XML::Snap->parse ("$text");
351             $self->{body}->add($mark);
352             }
353            
354             =head2 append_open_tag($string, $tag), append_close_tag ($string, $tag)
355            
356             Appends a opening or closing tag. Here, the C<$tag> is required. (Well, it will default to 'cf' if you screw up. But don't.)
357            
358             =cut
359            
360             sub append_open_tag {
361             my ($self, $text, $tag) = @_;
362             $tag = 'cf' unless $tag;
363             $text = XML::Snap->escape($text);
364             my $mark = XML::Snap->parse ("$text");
365             $self->{body}->add($mark);
366             }
367             sub append_close_tag {
368             my ($self, $text, $tag) = @_;
369             $tag = '/cf' unless $tag;
370             $text = XML::Snap->escape($text);
371             my $mark = XML::Snap->parse ("$text");
372             $self->{body}->add($mark);
373             }
374            
375             =head2 append_copy, copy_all
376            
377             If you have an XML piece from another TTX, you can append a copy of it directly into this TTX. Note that the "XML piece" from C and
378             C of a segment may actually be a list (because a segment may contain tags and text).
379             The C method copies the contents of another TTX's body tag into the current TTX, and can filter along the way.
380            
381             =cut
382            
383             sub append_copy {
384             my $self = shift;
385             foreach my $piece (@_) {
386             $self->{body}->add($piece); # This adds a copy of the piece if it's an XML node
387             }
388             }
389            
390             sub copy_all {
391             my $self = shift;
392             my $other = shift;
393             $self->{body}->copy_from($other->{body}, @_);
394             }
395            
396             =head1 READING FROM THE BODY
397            
398             Since a TTX is structured data, not just text, reading from it consists of iterating across its child elements. These elements
399             are L elements due to the underlying XML nature of the TTX file. I suppose some convenience functions might be a
400             good idea, but frankly it's so easy to use the XML::Snap functions (well, I did write XML::Snap) that I haven't needed any
401             so far. This might be a place to watch for further details.
402            
403             =head2 content_elements()
404            
405             Returns all the top-level content elements in a list. Depending on the structure of the TTX and the tool used to build it,
406             this level may not include all segments (I've had segmented TTX with the segments embedded in top-level formatting elements).
407            
408             =cut
409             sub content_elements {
410             my ($self) = @_;
411             my @returns = $self->{body}->children;
412             foreach (@returns) {
413             File::TTX::Content->rebless($_);
414             }
415             @returns;
416             }
417            
418             =head2 segments()
419            
420             Returns a list of just the segments in the body. Useful for data extraction.
421            
422             =cut
423            
424             sub segments {
425             my $self = shift;
426             my @returns = $self->{body}->all('Tu');
427             foreach (@returns) {
428             File::TTX::Content->rebless($_);
429             }
430             @returns;
431             }
432            
433            
434             =head1 MISCELLANEOUS STUFF
435            
436             =head2 date_now()
437            
438             Formats the current time the way TTX likes it.
439            
440             =cut
441            
442             sub date_now { strftime ('%Y%m%dT%H%M%SZ', localtime); }
443            
444            
445             =head1 File::TTX::Content
446            
447             This helper class wraps the L parts returned by C, providing a little more comfort when working
448             with them.
449            
450             =cut
451            
452             package File::TTX::Content;
453            
454             use base qw(XML::Snap);
455             use warnings;
456             use strict;
457            
458             =head2 rebless($xml)
459            
460             Called on an XML::Snap element to rebless it as a File::TTX::Content element. This is a class method.
461            
462             =cut
463            
464             sub rebless {
465             my ($class, $xml) = @_;
466             bless $xml, $class;
467             }
468            
469             =head2 type()
470            
471             Returns the type of content piece. The possible answers are 'text', 'open', 'close', 'segment', and 'mark'.
472            
473             =cut
474            
475             sub type {
476             my $self = shift;
477            
478             return 'text' if $self->istext;
479             return 'segment' if $self->is('Tu');
480             if ($self->is('ut')) {
481             return 'open' if $self->get('Type', '') eq 'start';
482             return 'close' if $self->get('Type', '') eq 'end';
483             return 'mark';
484             }
485             return 'unknown';
486             }
487            
488             =head2 tag()
489            
490             Returns (or sets) the tag or mark text of a tag or mark.
491            
492             =cut
493            
494             sub tag {
495             my $self = shift;
496             my $type = $self->type;
497             return '' if $type eq 'text';
498             return '' if $type eq 'segment';
499             return $self->set("DisplayText", shift);
500             }
501            
502             =head2 translated(), translated_xml()
503            
504             Returns the translated content of a segment, or just the content for anything else. Use with care. The C<_xml> variant returns the underlying
505             XML object - use with even more care.
506            
507             =cut
508            
509             sub translated_xml {
510             my $self = shift;
511             my $type = $self->type;
512             return $self unless $type eq 'segment';
513             my @t = $self->elements();
514             return $t[1]->children if defined $t[1];
515             return $t[0]->children;
516             }
517             sub translated {
518             my $self = shift;
519             my $type = $self->type;
520             return $self->rawcontent unless $type eq 'segment';
521             my @t = $self->elements();
522             return $t[1]->rawcontent if defined $t[1];
523             return $t[0]->rawcontent;
524             }
525            
526             =head2 write_translated($thing)
527            
528             If not called on a segment, does nothing at all. Eventually, of course, it will have to be possible to identify a text area and segment it,
529             but this is not that function.
530            
531             If called on a segment with a string, deletes whatever may be in the segment's translated half, creates an XML::Snap text object from the string,
532             and inserts said object. If called on a segment with an XML::Snap object, insert it. If called with a list of things, inserts one after the
533             other with the same rules.
534            
535             =cut
536            
537             sub write_translated {
538             my $self = shift;
539             my $type = $self->type;
540             return unless $type eq 'segment';
541             my @t = $self->elements();
542             return unless defined $t[1]; # Not sure if this can actually happen, but it's best to play it safe.
543             my $t = $t[1];
544             $$t{children} = []; # Cheating a little here, because I know this is an XML::Snap object underneath.
545             for my $element (@_) {
546             $t->add($element);
547             }
548             }
549            
550             =head2 source(), source_xml()
551            
552             Returns the source content of a segment, or just the content for anything else. The C<_xml> variant returns the xml object, so you get the tag
553             structure if it's a complex source segment.
554            
555             =cut
556            
557             sub source_xml {
558             my $self = shift;
559             my $type = $self->type;
560             return $self unless $type eq 'segment';
561             $self->first('Tuv')->children;
562             }
563             sub source {
564             my $self = shift;
565             my $type = $self->type;
566             return $self->rawcontent unless $type eq 'segment';
567             my $t = $self->first('Tuv');
568             return $t->rawcontent;
569             }
570            
571             =head2 write_source($thing)
572            
573             Works I, except on the source, which Trados tools won't let you do. Use with care.
574            
575             =cut
576            
577             sub write_source {
578             my $self = shift;
579             my $type = $self->type;
580             return unless $type eq 'segment';
581             my @t = $self->elements();
582             return unless defined $t[0]; # Not sure if this can actually happen, but it's best to play it safe.
583             my $t = $t[0];
584             $$t{children} = []; # Cheating a little here, because I know this is an XML::Snap object underneath.
585             for my $element (@_) {
586             $t->add($element);
587             }
588             }
589            
590             =head2 match()
591            
592             Returns and/or sets the recorded match percent of a segment (or 0 if it's not a segment).
593            
594             =cut
595            
596             sub match {
597             my $self = shift;
598             my $type = $self->type;
599             return 0 unless $type eq 'segment';
600             $self->set('MatchPercent', shift);
601             }
602            
603             =head2 source_lang(), translated_lang()
604            
605             Returns and/or sets the source or target language of a segment (or nothing if it's not a segment).
606            
607             =cut
608            
609             sub source_lang {
610             my $self = shift;
611             return unless $self->type eq 'segment';
612             my $xml = $self->search_first('Tuv');
613             $xml->set('Lang', shift) if $xml;
614             }
615             sub translated_lang {
616             my $self = shift;
617             return unless $self->type eq 'segment';
618             my @t = $self->elements();
619             my $xml = $t[1] if defined $t[1];
620             $xml->set('Lang', shift) if $xml;
621             }
622            
623             =head2 Other things we'll want
624            
625             The XML::Snap doesn't support the full range of XML manipulation in its current incarnation, so I'll need to revisit it, and
626             also I don't need all this functionality today, but here's what the content handler should be able to do:
627            
628             - Segment non-segmented text, replacing a chunk or series of chunks (in case neighboring text chunks don't cover a full segment)
629             with a segment or a segment-plus-extra-text.
630             - Translate a segment, i.e. replace the translated content.
631             - Modify the source of a segment (just in case).
632            
633             If you are actually using Perl to access TTX files and would like to do these things, then by all means drop me a line and tell me
634             to get the lead out.
635            
636             =head1 AUTHOR
637            
638             Michael Roberts, C<< >>
639            
640             =head1 BUGS
641            
642             Please report any bugs or feature requests to C, or through
643             the web interface at L. I will be notified, and then you'll
644             automatically be notified of progress on your bug as I make changes.
645            
646            
647            
648            
649             =head1 SUPPORT
650            
651             You can find documentation for this module with the perldoc command.
652            
653             perldoc File::TTX
654            
655            
656             You can also look for information at:
657            
658             =over 4
659            
660             =item * RT: CPAN's request tracker
661            
662             L
663            
664             =item * AnnoCPAN: Annotated CPAN documentation
665            
666             L
667            
668             =item * CPAN Ratings
669            
670             L
671            
672             =item * Search CPAN
673            
674             L
675            
676             =back
677            
678            
679             =head1 ACKNOWLEDGEMENTS
680            
681            
682             =head1 LICENSE AND COPYRIGHT
683            
684             Copyright 2010 Michael Roberts.
685            
686             This program is free software; you can redistribute it and/or modify it
687             under the terms of either: the GNU General Public License as published
688             by the Free Software Foundation; or the Artistic License.
689            
690             See http://dev.perl.org/licenses/ for more information.
691            
692            
693             =cut
694            
695             1; # End of File::TTX