File Coverage

blib/lib/Lingua/StarDict/Writer.pm
Criterion Covered Total %
statement 89 89 100.0
branch 6 8 75.0
condition n/a
subroutine 14 14 100.0
pod 2 3 66.6
total 111 114 97.3


line stmt bran cond sub pod time code
1             package Lingua::StarDict::Writer;
2              
3 2     2   77776 use 5.008;
  2         7  
4 2     2   12 use strict;
  2         4  
  2         51  
5 2     2   10 use warnings;
  2         4  
  2         95  
6 2     2   1840 use Path::Tiny;
  2         27238  
  2         157  
7 2     2   1404 use Unicode::UTF8;
  2         1306  
  2         100  
8 2     2   1279 use Time::Piece;
  2         22327  
  2         18  
9 2     2   1653 use Moo;
  2         25708  
  2         18  
10              
11              
12 2     2   4643 use Lingua::StarDict::Writer::Entry;
  2         10  
  2         890  
13              
14             =encoding utf8
15             =cut
16              
17             =head1 NAME
18              
19             Lingua::StarDict::Writer - A module that allows to create a StarDict dictionary
20              
21             =head1 VERSION
22              
23             Version 0.01
24              
25             =cut
26              
27             our $VERSION = '0.02';
28              
29              
30             =head1 SYNOPSIS
31              
32             A module that allows to create a StarDict-compatible dictionary, with multipart
33             and multitype entries.
34              
35             use Lingua::StarDict::Writer;
36              
37             my $stardict_writer = StarDict::Writer->new(name=>'My Cool Dictionary', date=>"2020-12-31");
38              
39             $stardict_writer->entry('42')->add_part(type=> "t", data => "ˈfɔɹti tuː");
40             $stardict_writer->entry('42')->add_part(type=> "m", data => "Answer to the Ultimate Question of Life, the Universe, and Everything");
41              
42             $stardict_writer->entry('Perl')->add_part(type=> "t", data => "pɛʁl");
43             $stardict_writer->entry('Perl')->add_part(type=> "h", data => "The best programming language ever");
44              
45             $stardict_writer->write;
46              
47             =head1 DESCRIPTION
48              
49             StarDict is a popular dictionary format, supported by many dictionary and book reading programs.
50              
51             StarDict entry may consist of several parts of various text or media types.
52              
53             This module allows to create a new StarDict dictionary with entries consisting of parts of
54             arbitrary types.
55              
56             =head1 METHODS
57              
58             =head2 new ( option_name => 'value')
59              
60             Constructs and returns a new C object. This object will accept parts for the
61             dictionary entry via C method. You can write the resulting dictionary with C method.
62             C method accepts arguments represented as C<< name=>value >> options hash. Following options are available:
63              
64             =over
65              
66             =item * C - sets a name for the dictionary. It will be specified in StarDict dictionary C<.ifo> file as
67             dictionary name. When you call C method, writer will create C dir in the C dir, and all
68             dictionary files that will be written there will use C as the base part of file name.
69             By default, the name will be set to C<"Some Dictionary written by Lingua::StarDict::Writer"> if none is provided.
70              
71             =item * C - date of dictionary creation in C format. Will be saved in Stardict C<.ifo> file.
72             By default, current date will be used.
73              
74             =item * C - path where dictionary files will be saved. By default, current dir will be used as the C.
75              
76             =back
77              
78             =head2 entry($entry_title)
79              
80             Returns dictionary entry named C<$entry_title>. If entry C<$entry_title> does not exist, a new empty
81             dictionary entry will be created and returned. The only reason you may want to get a dictionary entry is to
82             add a new part using C method (See below)
83              
84             Entries can be added in arbitrary order, they will be sorted alphabetically using StarDict sorting algorithm, when
85             dictionary is written.
86              
87             =head2 entry->($enry_title)->add_part(type => $part_type, data => $part_data)
88              
89             Adds new part to an entry.
90              
91             =over
92              
93             =item * C - part type, coded as one Latin letter as specified in StarDictFileFormat. (C<'m'> for plaintext, C<'h'> for html,
94             C<'t'> for pronunciation, etc. See StarDictFileFormat in "See Also" chapter for more info). By default C<'m'> type will
95             be used if none is specified.
96              
97             =item * C - Content of added entry part: a text string that can be formatted using chosen C markup.
98              
99             =back
100              
101             Parts will be saved in the entry in the order they were added.
102              
103             =head2 write
104              
105             This method will write all entries to the disk formatted as StarDict dictionary. C<.dict>, C<.idx> and C<.ifo> files
106             will be placed in directory C at the path specified in C option. You should put them to C or
107             C<~/.stardict/dic> path to make them visible to StarDict.
108              
109             =head1 ENCODING ISSUE
110              
111             All methods expect to recieve data encoded as perl character strings, not as byte string (i.e. Cyrillic "я" should be encoded as C<\x{44f}>,
112             and not as C<\x{d1}\x{8f}>). If you have read utf-8 source data from a file, database or from web, make sure that utf-8 bytes you've got
113             are converted to perl characters. See L for more info.
114              
115              
116             =head1 CAVEATS
117              
118             =over
119              
120             =item * C mode is not implemented. Use custom sequence mode instead.
121              
122             =item * Support for binary parts is not implemented.
123              
124             =item * Dictionary compression is not implemented.
125              
126             =item * Synonyms are not implemented.
127              
128             =back
129              
130             etc...
131              
132             =head1 SEE ALSO
133              
134             =over
135              
136             =item * L - StarDict
137             format description. A copy of this file can be found in this package in C dir.
138              
139             =item * L - another module for writing StarDict dictionaries. It supports only
140             single-part plain text entries.
141              
142             =back
143              
144             =cut
145              
146             has 'name' => (
147             is => 'rw',
148             default => "Some Dictionary written by Lingua::StarDict::Writer",
149             );
150              
151             has 'date' => (
152             is => 'rw',
153             default => sub {localtime->ymd},
154             );
155              
156             has 'output_dir' => (
157             is => 'rw',
158             default => "./",
159             );
160              
161             has '_entries' => (
162             is => 'rw',
163             default => sub {+{}}
164             );
165              
166             sub entry
167             {
168 39     39 1 29318 my $self = shift;
169 39         69 my $name = shift;
170              
171 39 100       155 if (! exists $self->_entries->{$name})
172             {
173 18         445 $self->_entries->{$name} = Lingua::StarDict::Writer::Entry->new(name=>$name); # Create new entry if does not exist
174             }
175 39         229 return $self->_entries->{$name};
176             }
177              
178             sub write
179             {
180 3     3 1 889 my $self = shift;
181 3         73 my $dir = path($self->output_dir,$self->name);
182              
183 3 50       370 unless(-d $dir)
184             {
185 3 50       208 mkdir($dir,0755) or die "Cant create directory $dir\n";
186             }
187              
188 3         412 my $dict_fh = $dir->child($self->name.".dict")->openw_raw;
189 3         769 my $idx_fh = $dir->child($self->name.".idx" )->openw_raw;
190 3         542 my $ifo_fh = $dir->child($self->name.".ifo" )->openw_utf8;
191              
192 3         15395 my $byte_count = 0;
193 3         9 my @ordered_keys = sort {stardict_strcmp($a,$b)} keys(%{$self->_entries});
  37         57  
  3         52  
194              
195 3         15 foreach my $word (@ordered_keys)
196             {
197 17         27 my $start_pos = $byte_count;
198 17         22 my $word_bytes;
199             {
200 2     2   25 use warnings FATAL => 'utf8';
  2         6  
  2         254  
  17         22  
201 17         39 $word_bytes = Unicode::UTF8::encode_utf8($word);
202             }
203 17         79 print $idx_fh pack('a*x',$word_bytes);
204 17         33 print $idx_fh pack('N',$byte_count);
205              
206 17         29 foreach my $part (@{$self->entry($word)->_parts})
  17         34  
207             {
208 19         41 my $data = $part->{data};
209 19         29 my $data_bytes;
210             {
211 2     2   16 use warnings FATAL => 'utf8';
  2         5  
  2         739  
  19         27  
212 19         35 $data_bytes = Unicode::UTF8::encode_utf8($data);
213             }
214 19         81 print $dict_fh $part->{type};
215 19         42 print $dict_fh "$data_bytes\0";
216 19         39 $byte_count += length($data_bytes) + 1 + 1; # one for media type char, one for \0
217             }
218              
219 17         42 print $idx_fh pack('N',$byte_count-$start_pos);
220             }
221              
222 3         7 my $word_count = scalar (keys %{$self->_entries});
  3         10  
223              
224 3         34 print $ifo_fh "StarDict's dict ifo file\n";
225 3         10 print $ifo_fh "version=2.4.2\n";
226 3         12 print $ifo_fh "wordcount=$word_count\n";
227 3         16 print $ifo_fh "bookname=".$self->name."\n";
228 3         26 print $ifo_fh "idxfilesize=", tell($idx_fh),"\n";
229 3         17 print $ifo_fh "date=".$self->date."\n";
230             # if($^O eq "MSWin32"){ print $ifo_fh "sametypesequence=m\n";} # do not support sametypesequence for now
231             # else { print $ifo_fh "sametypesequence=x\n";}
232              
233 3         223 close $dict_fh;
234 3         85 close $idx_fh;
235 3         181 close $ifo_fh;
236             }
237              
238              
239             # g_ascii_strcasecmp, strcmp, stardict_strcmp_old (formerly known as stardict_strcmp)
240             # are pure perl reimplementation of sort functons used by StarDirct for index lookup.
241             # Index file should be ordered with exactly the same functions that is used for lookup
242             # g_ascii_strcasecmp, strcmp, stardict_strcmp_old are left here commented out for historical
243             # reasons.
244             # stardict_strcmp is a perl-way implementation of sort functions that do same ordering
245             # as function mentioned above
246              
247             #sub g_ascii_strcasecmp
248             #{
249             # # pure perl re-implementation of g_ascii_strcasecmp
250             # my $s1 = shift;
251             # my $s2 = shift;
252             # no locale;
253             # $s1=~s/([A-Z])/lc($1)/ge;
254             # $s2=~s/([A-Z])/lc($1)/ge;
255             # while (length($s1) || length($s2))
256             # {
257             # return -1 if length($s1)==0;
258             # return 1 if length($s2)==0;
259             # $s1=~s/^(.)//;
260             # my $c1 = $1;
261             # $s2=~s/^(.)//;
262             # my $c2 = $1;
263             # return ord($c1)-ord($c2) if $c1 ne $c2;
264             # }
265             # return 0;
266             #}
267              
268              
269             #sub strcmp
270             #{
271             # # pure perl re-implementation of strcmp
272             # my $s1 = shift;
273             # my $s2 = shift;
274             # no locale;
275             # while (length($s1) || length($s2))
276             # {
277             # return -1 if length($s1)==0;
278             # return 1 if length($s2)==0;
279             # $s1=~s/^(.)//;
280             # my $c1 = $1;
281             # $s2=~s/^(.)//;
282             # my $c2 = $1;
283             # return ord($c1)-ord($c2) if $c1 ne $c2;
284             # }
285             # return 0;
286             #}
287              
288             #sub stardict_strcmp_old
289             #{
290             # # pure perl re-implementation of stardict_strcmp
291             # my $s1 = shift;
292             # my $s2 = shift;
293             #
294             # my $i = g_ascii_strcasecmp($s1, $s2);
295             # return $i if $i;
296             # return strcmp($s1,$s2);
297             #}
298              
299              
300              
301             # StarDict expects index file to be sorted in a specific way.
302             # UTF-8 strings are treated as bytes, all latin (and only latin) characters
303             # are taken to lower case, and then strings are compared. If strings copmared
304             # that way found equal, they are compared again, now without converting latin
305             # letters to lower case.
306              
307             # For more info see doc/StarDictFileFormat or StarDict code. You should look for
308             # strcmp and stardict_strcmp functions there, and for g_ascii_strcasecmp in glibc.
309              
310             sub stardict_strcmp
311             {
312 37     37 0 53 my $s1 = shift;
313 37         51 my $s2 = shift;
314              
315 37         52 my $s1_bytes;
316             my $s2_bytes;
317             {
318 2     2   18 use warnings FATAL => 'utf8';
  2         4  
  2         402  
  37         48  
319 37         72 $s1_bytes = Unicode::UTF8::encode_utf8($s1); # Convert sting from unicde characters to bytes
320 37         60 $s2_bytes = Unicode::UTF8::encode_utf8($s2);
321             }
322              
323 37         52 my $s1_lc_bytes = $s1_bytes;
324 37         55 $s1_lc_bytes =~ tr/A-Z/a-z/; # do lower case
325 37         53 my $s2_lc_bytes = $s2_bytes;
326 37         51 $s2_lc_bytes =~ tr/A-Z/a-z/;
327              
328 37         52 my $res = $s1_lc_bytes cmp $s2_lc_bytes; # Compare lower case string represented as bytes
329              
330 37 100       69 $res = $s1_bytes cmp $s2_bytes unless $res; # if equal, compare unlowercased string represented as bytes;
331              
332 37         73 return $res;
333             }
334              
335              
336             =head1 AUTHOR
337              
338             Nikolay Shaplov, C<< >>
339              
340              
341             =head1 BUGS
342              
343             Please report any bugs or feature requests through
344             the web interface at L
345              
346             =head1 SUPPORT
347              
348             You can find documentation for this module with the perldoc command.
349              
350             perldoc Lingua::StarDict::Writer
351              
352              
353             You can also look for information at:
354              
355             =over 4
356              
357             =item * The project's bug tracker (report bugs and request features here)
358              
359             L
360              
361             =item * AnnoCPAN: Annotated CPAN documentation
362              
363             L
364              
365             =item * CPAN Ratings
366              
367             L
368              
369             =item * Search CPAN
370              
371             L
372              
373             =back
374              
375              
376             =head1 ACKNOWLEDGEMENTS
377              
378             Special thanks to B from C C<#perl> for deep code review.
379              
380             Thanks to B for proofreading.
381              
382             =head1 LICENSE AND COPYRIGHT
383              
384             Copyright 2021 Nikolay Shaplov.
385              
386             This program is free software; you can redistribute it and/or modify it
387             under the terms of the the Artistic License (2.0). You may obtain a
388             copy of the full license at:
389              
390             L
391              
392             Any use, modification, and distribution of the Standard or Modified
393             Versions is governed by this Artistic License. By using, modifying or
394             distributing the Package, you accept this license. Do not use, modify,
395             or distribute the Package, if you do not accept this license.
396              
397             If your Modified Version has been derived from a Modified Version made
398             by someone other than you, you are nevertheless required to ensure that
399             your Modified Version complies with the requirements of this license.
400              
401             This license does not grant you the right to use any trademark, service
402             mark, tradename, or logo of the Copyright Holder.
403              
404             This license includes the non-exclusive, worldwide, free-of-charge
405             patent license to make, have made, use, offer to sell, sell, import and
406             otherwise transfer the Package with respect to any patent claims
407             licensable by the Copyright Holder that are necessarily infringed by the
408             Package. If you institute patent litigation (including a cross-claim or
409             counterclaim) against any party alleging that the Package constitutes
410             direct or contributory patent infringement, then this Artistic License
411             to you shall terminate on the date that such litigation is filed.
412              
413             Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER
414             AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
415             THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
416             PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY
417             YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR
418             CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR
419             CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE,
420             EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
421              
422             =cut
423              
424             1; # End of Lingua::StarDict::Writer