File Coverage

blib/lib/Statistics/Sequences/Vnomes.pm
Criterion Covered Total %
statement 22 24 91.6
branch n/a
condition n/a
subroutine 8 8 100.0
pod n/a
total 30 32 93.7


line stmt bran cond sub pod time code
1             package Statistics::Sequences::Vnomes;
2              
3 1     1   22171 use 5.008008;
  1         6  
  1         44  
4 1     1   6 use strict;
  1         2  
  1         34  
5 1     1   5 use warnings;
  1         7  
  1         32  
6 1     1   6 use Carp 'croak';
  1         2  
  1         83  
7 1     1   15 use vars qw($VERSION @ISA);
  1         1  
  1         62  
8 1     1   1033 use Algorithm::Combinatorics qw(variations_with_repetition);
  1         4501  
  1         83  
9 1     1   815 use Math::Cephes 0.43;
  1         27043  
  1         74  
10 1     1   580 use Statistics::Sequences 0.11;
  0            
  0            
11             use Statistics::Lite qw(sum);
12             @ISA = qw(Statistics::Sequences);
13              
14             $VERSION = '0.11';
15              
16             =pod
17              
18             =head1 NAME
19              
20             Statistics::Sequences::Vnomes - The Serial Test (psi-square) and Generalized Serial Test (delta psi-square) for equiprobability of v-nomes (or I<v>-plets/bits) (Good's and Kendall-Babington Smith's tests)
21              
22             =head1 SYNOPSIS
23              
24             use Statistics::Sequences::Vnomes 0.11;
25             my $vnomes = Statistics::Sequences::Vnomes->new();
26             $vnomes->load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); # data treated categorically - could also be, e.g., 'a' and 'b'
27             my $r_freq_href = $vnomes->observed(length => 3); # returns hashref of frequency distribution for trinomes in the sequence
28             my @freq = $vnomes->observed(length => 3); # returns only the observed trinome frequencies (not keyed by the trinomes themselves)
29             $val = $vnomes->expected(length => 3); # mean chance expectation for the frequencies (2.5)
30             $val = $vnomes->psisq(length => 3); # Good's "second backward differences" psi-square (3.4); option 'delta' gives alternative estimates
31             $val = $vnomes->p_value(length => 3, tails => 1); # 1-tailed p-value for psi-square (0.0913)
32             $val = $vnomes->z_value(length => 3, tails => 1, ccorr => 1); # inverse-phi of the 1-tailed p-value (1.333)
33             my $href = $vnomes->stats_hash(length => 3, values => {psisq => 1, p_value => 1}, tails => 1); # include any stat-method (& their options)
34             $vnomes->dump(length => 3,
35             values => {observed => 1, expected => 1, psisq => 1, p_value => 1}, # what stats to show (or not if => 0)
36             format => 'table', flag => 1, precision_s => 3, precision_p => 7, verbose => 1, tails => 1);
37             # prints:
38             # Vnomes (3) statistics
39             #.-----------+-----------+-----------+-----------.
40             #| observed | expected | p_value | psisq |
41             #| | | | |
42             #+-----------+-----------+-----------+-----------+
43             #| '011' = 4 | 2.500 | 0.0913418 | 3.400 |
44             #| '010' = 1 | | | |
45             #| '111' = 2 | | | |
46             #| '000' = 1 | | | |
47             #| '101' = 2 | | | |
48             #| '001' = 3 | | | |
49             #| '100' = 3 | | | |
50             #| '110' = 4 | | | |
51             #'-----------+-----------+-----------+-----------'
52             #psisq is Good's delta^2-psi^2, calculated with second backward differences, and has 2 degrees of freedom.
53             #p-value is 1-tailed.
54             # these and other methods inherited from Statistics::Sequences and its parent, Statistics::Data:
55             $vnomes->dump_data(delim => ','); # comma-separated single line of the loaded data
56             $vnomes->save(path => 'seq.csv'); # serialized for retrieval by open() method
57              
58             =head1 DESCRIPTION
59              
60             Implements tests of the independence of successive elements of a sequence of data: serial tests for v-nomes (a.k.a I<v>-plets or, for binary data, I<v>-bits) - singlets/monobits, dinomes/doublets, trinomes/triplets, etc.. Test are of variations in sub-sequence length I<v> that are equally likely for the sampled sequence. For example, a sequence sampled from a "heads'n'tails" (H and T) distribution can be tested for its equal representation of the trinomes HTH, HTT, TTT, THT, and so on. Counting up these v-nomes at all points in the sequence, permitting overlaps, yields a statistic - psi-square - that is approximately distributed as I<chi>-square; the Kendall-Babington Smith statistic.
61              
62             However, because these counts are not independent (given the overlaps), Good's Generalized Serial Test is the default test-statistic returned by this module's L<test|test> routine: It computes psi-square by differencing, viz., in relation to not only the specified B<length>, or value of I<v>, but also its value for the first two prior lengths of I<v>, yielding a statistic, delta-square-psi-square (the "second backward difference" measure) that is I<exactly> distributed as I<chi>-square.
63              
64             The test is suitable for multi-state data, not only the binary, dichotomous sequence suitable for the L<runs|Statistics::Sequences::Runs> or L<joins|Statistics::Sequences::Joins> tests. It can also be used to test that the individual elements in the list are uniformly distributed, that the states are equally represented, i.e., as a I<chi>-square-based frequency test (a.k.a. test of uniformity, equiprobability, equidistribution). This is done by giving a B<length> of 1, i.e., testing for mononomes.
65              
66             Note that this is I<not> the so-called serial test described by Knuth (1998, Ch. 2), which involves non-overlapping pairs of sequences.
67              
68             =head1 METHODS
69              
70             Methods include those described in L<Statistics::Sequences>, which can be used directly from this module, as follows.
71              
72             =head2 new
73              
74             $vnomes = Statistics::Sequences::Vnomes->new();
75              
76             Returns a new Vnomes object. Expects/accepts no arguments but the classname.
77              
78             =head2 load
79              
80             $vnomes->load(@data); # anonymously
81             $vnomes->load(\@data);
82             $vnomes->load('sample1' => \@data); # labelled whatever
83              
84             Loads data anonymously or by name - see L<load|Statistics::Data/load, add, unload> in L<Statistics::Data> for ways data can be loaded and retrieved (more than shown here). Every load unloads all previous loads and any additions to them.
85              
86             Data for this test of sequences can be categorical or numerical, all being treated categorically. Also, the data do not have to be dichotomous (unlike in tests of L<runs|Statistics::Sequences::Runs> and L<joins|Statistics::Sequences::Joins>.
87              
88             =head2 add, read, unload
89              
90             See L<Statistics::Data> for these and other operations on loaded data.
91              
92             =head2 observed, vnomes_observed, vno
93              
94             $href = $vnomes->observed(length => int, circularize => 1|0); # returns keyed distribution; assumes data have already been loaded
95             @ari = $vnomes->observed(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => int, circularize => 1|0); # returns frequencies only
96              
97             Returns the frequency distribution for the I<observed> number of v-nomes of the given B<length> in a sequence. These are counted up as overlapping. Called in array context, returns an array of the counts for each possible pattern; otherwise, where these are keyed by the pattern in a hash reference. For descriptives of the observed frequencies, try calling L<Statistics::Lite|Statistics::Lite> methods with this array.
98              
99             A value for B<length> greater than zero is required, and must be no more than the sample-size. So for the sequence
100              
101             1 0 0 0 1 0 0 1 0 1 1 0
102              
103             there are 12 v-nomes of length 1 (mononomes, the number of elements) from the two (0 and 1) that are possible; 11 dinomes from the four (10, 00, 00, 01) that are possible; and 10 trinomes from eight (100, 000, 001, 010, etc.) that are possible.
104              
105             By default, the sequence is counted up for v-nomes as a cyclic sequence, treating the first element of the sequence as following the last one. So the count loops to the beginning of the sequence until all elements from the end are included, and instead of ending the count for trinomes in this sequence at '110', the count includes '101' and '010', increasing the observed sum of trinomes to 12. Set L<circularize|circularize> => 0 if the count is not to be made cyclically.
106              
107             The data to test can already have been L<load|Statistics::Sequences/load, add, unload>ed, or you send it directly keyed as B<data>.
108              
109             In the code for this and/or other methods, following the work of Good (1953, 1957; Good & Gover, 1967), I<v> is used to define variables referring to the length of the subsequence (I<mono>nome, I<di>nome, I<tri>nome, etc.), I<t> defines the number of states (events, letters, etc.) in the sequence, and I<r> identifies each possible subsequence of length I<v>.
110              
111             =cut
112              
113             sub observed {
114             my $self = shift;
115             my $args = ref $_[0] ? shift : {@_};
116             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
117             my $num = _get_n($self, $args);
118             my $v = $args->{'length'} || croak __PACKAGE__, '::test Must have a v-nome length greater than zero';
119             # Sanitize length in proportion to size:
120             croak __PACKAGE__, "::test Sequence length v ($v) for testing should be no more than the sample-size ($num) - 2" if $v >= $num - 1;
121             #my $lim = Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
122             my $circularize = defined $args->{'circularize'} ? $args->{'circularize'} : 1;
123             my $states_aref = _get_stateslist($data, $args->{'states'});# Get list of unique states as given or from sequence itself
124             my $v_aref = _get_v_ari($v);
125             my ($v_i, @data_i, %psisq_v) = ();
126             foreach $v_i(@$v_aref) {# print "v_i = $v_i\n"; # loop through tests of v, v-1 and v-2 lengths ...
127             @data_i = @$data;# extend sequence by appending the first $v_i - 1 elements to its end
128             push(@data_i, @data_i[0 .. $v_i - 2]) if $circularize; # e.g., if v = 3, append elements [0] & [1]; if v = 2, append only 1st value
129             $psisq_v{$v_i} = _frequencies(\@data_i, $v_i, $states_aref);# count up frequencies of each sequence of length $v_i
130             last if $v_i == $v;
131             }
132             return wantarray ? values %{$psisq_v{$v}} : $psisq_v{$v};
133             }
134             *vnomes_observed = \&observed;
135             *vno = \&observed;
136              
137             =head2 expected, vnomes_expected, vne
138              
139             $count = $vnomes->expected(length => int, circularize => 1|0); # assumes data have already been loaded
140             $count = $vnomes->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => int, circularize => 1|0, states => [0, 1]);
141              
142             Returns the I<expected> number of observations for each v-nome of the given B<length> in a sequence; i.e., the mean chance expectation, assuming that each event is generated by a random uniform process. Options are as for L<observed|Statistics::Sequences::Vnomes/observed>. This expected frequency is given by:
143              
144             =for html <p>&nbsp;&nbsp;<i>E[V]</i> = <i>N</i><i>t</i><sup>&ndash;<i>v</i></sup>
145              
146             where I<t> is the number of possible states (alternatives; as read from the data or as explicitly given as an array in B<states>), I<v> is the v-nome length, and I<N> is the length of the sequence, less I<v> + 1 if the count is not to be circularized (Good, 1953, Eq. 12).
147              
148             Another way to think of this is as the number of mononome observations in the sequence divided by the number of possible permutations of its states for the given length. So, for a sequence made up of 0s and 1s, there are four possible variations of length 2 (00, 10, 01 and 11), so that the expected frequency for each of these variations in a sequence of 20 values is 20 / 4, i.e., 5.
149              
150             =cut
151              
152             sub expected {
153             my $self = shift;
154             my $args = ref $_[0] ? shift : {@_};
155             my $num = _get_n($self, $args);
156             my $v = $args->{'length'} || croak __PACKAGE__, '::test Must have a v-nome length greater than zero';
157             # Sanitize length in proportion to size:
158             croak __PACKAGE__, "::test Sequence length v ($v) for testing should be no more than the sample-size ($num) - 2" if $v >= $num - 1; # or permit $v to extend to limit of circularized data?
159             #my $lim = Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
160             my $circularize = defined $args->{'circularize'} ? $args->{'circularize'} : 1; # circularize by default
161             # Compute expected freq of variations of $t of length $v:
162             $num = _count_lim($num, $v) if ! $circularize;
163             my $p = $self->prob_r(length => $v);
164             return $num * $p;
165             }
166             *vnomes_expected = \&expected;
167             *vne = \&expected;
168              
169             =head2 variance
170              
171             $var = $vnomes->variance(length => int, circularize => 1|0); # assumes data have already been loaded
172             $var = $vnomes->variance(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => int, circularize => 1|0, states => [0, 1]);
173              
174             Returns the variance in the expected frequency of each v-nome, as per Good (1953, Eqs. 11 and 14).
175              
176             =cut
177              
178             sub variance {
179             #croak 'variance() method is not defined for ' . __PACKAGE__;
180             my $self = shift;
181             my $args = ref $_[0] ? shift : {@_};
182             my $num = _get_n($self, $args);
183             my $v = $args->{'length'} || croak __PACKAGE__, '::test Must have a v-nome length greater than zero';
184             # Sanitize length in proportion to size:
185             croak __PACKAGE__, "::test Sequence length v ($v) for testing should be no more than the sample-size ($num) - 2" if $v >= $num - 1; # or permit $v to extend to limit of circularized data?
186             #my $lim = Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
187             # Compute expected freq of variations of $t of length $v:
188             my $pr = $self->prob_r(length => $v);
189             my $var = $pr;
190             my ($sum_1, $sum_2, $m, $pm) = (0, 0);
191             foreach $m(1 .. $v - 1) {
192             $pm = $self->prob_r(length => $m);
193             $sum_1 += $pm;
194             $sum_2 += ($m + $v - 1) * $pm;
195             }
196             $var += 2 * $pr * $sum_1;
197             $var -= (2 * $v - 1) * $pr**2;
198             my $circularize = defined $args->{'circularize'} ? $args->{'circularize'} : 1;
199             $var += $num**-1 * $pr * ( ( ($v - 1) * (3 * $v - 1) * $pr ) - ($v - 1) - 2 * $sum_2 ) if !$circularize;
200             return $var;
201             }
202              
203             sub obsdev {
204             croak 'obsdev() method is not defined for ' . __PACKAGE__;
205             }
206             *observed_deviation = \&obsdev;
207              
208             =head2 stdev, standard_deviation
209              
210             $var = $vnomes->stdev(length => int, circularize => 1|0); # assumes data have already been loaded
211             $var = $vnomes->stdev(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => int, circularize => 1|0, states => [0, 1]);
212              
213             Returns the standard deviation in the expected frequency of each v-nome, as per Good (1953, Eqs. 11 and 14).
214              
215             =cut
216              
217             sub stdev {
218             return sqrt(variance(@_));
219             }
220             *standard_deviation = \&stdev;
221              
222             =head2 psisq
223              
224             $vnomes->psisq(length => int, delta => '2|1|0', circularize => '1|0', states => [qw/A C G T/]);
225              
226             Performs Good's Generalized Serial Test (by default), of v-nomes on the given or named distribution, yielding a I<psi-square> statistic. The option B<delta> specifies if backward differencing.
227              
228             =over 4
229              
230             =item delta =E<gt> 0
231              
232             The raw psi-square value for sub-sequences of length I<v> (Kendall-Babington Smith statistic), i.e., without backward differencing.
233              
234             =for html <p>&nbsp;&nbsp;&nbsp; &Psi;<sup>2</sup><sub>v</sub> = &sum;{<i>r</i>} ( <i>n<sub>r</sub></i>&ndash; (<i>N</i> &ndash; <i>v</i> + 1) / <i>t<sup>v</sup></i> )<sup>2</sup> ) / ( (<i>N</i> &ndash; <i>v</i> + 1) / <i>t<sup>v</sup></i> )</p>
235              
236             for uncircularized sequences (Good, 1953, Eq. 1), and
237              
238             =for html <p>&nbsp;&nbsp;&nbsp; <span style="text-decoration:overline;">&Psi;</span><sup>2</sup><sub>v</sub> = &sum;{<i>r</i>} ( <i><span style="text-decoration:overline;">n</style><sub>r</sub></i>&ndash; <i>Nt<sup>&ndash;v</sup></i> ) / <i>Nt<sup>&ndash;v</sup></i></p>
239              
240             for circularized sequences (Good, 1953, Eq. 2), where an overline indicates circularization, and where I<N> is the length of the sequence, I<v> is the v-nome length, I<t> is the number of unique states that each element of the sequence can take, and I<r> is the number of variations of length I<v> for the given number of unique states.
241              
242             This statistic is only asymptotically distributed as I<chi>-square if I<length> (I<v>) = 1, and is therefore not used as the default.
243              
244             =item delta =E<gt> 1
245              
246             The "I<first> backward differences" of psi-square, which is the difference between the psi-square values for sub-sequences of length I<v> and length I<v> - 1.
247              
248             =for html <p>&nbsp;&nbsp;&nbsp; &Delta;&Psi;<sup>2</sup><sub>v</sub> = &Psi;<sup>2</sup><sub>v</sub> &ndash; &Psi;<sup>2</sup><sub>v&ndash;1</sub> (<i>v</i> &ge; 1)</p>
249              
250             While it is I<chi>-square distributed, counts of first-differences are not statistically independent (Good, 1953; Good & Gover, 1967), and "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flat-randomness" (Good & Gover, 1967, p. 104).
251              
252             =item delta =E<gt> 2 (or undef)
253              
254             This method returns by default (without specifying a value for B<delta>), or if B<delta> is not 1 or 0, the "second backward differences" psi-square (I<delta^2-psi^2>). This incorporates psi-square values for backwardly adjacent values of I<length> (I<v>), i.e., for sub-sequences of length I<v>, I<v> - 1, and I<v> - 2.
255              
256             =for html <p>&nbsp;&nbsp;&nbsp; &Delta;<sup>2</sup>&Psi;<sup>2</sup><sub>v</sub> = &Psi;<sup>2</sup><sub>v</sub> &ndash; 2&Psi;<sup>2</sup><sub>v&ndash;1</sub> &ndash; &Psi;<sup>2</sup><sub>v&ndash;2</sub> (<i>v</i> &ge; 2)</p>
257              
258             It is not only asymptotically I<chi>-square distributed, but uses statistically independent counts of all the possible variations of sequences of the given B<length> (Good, 1953).
259              
260             =back
261              
262             See also Good's algorithm in individual papers describing application of the Serial Test (e.g., Davis & Akers, 1974).
263              
264             =cut
265              
266             sub psisq {
267             my $self = shift;
268             my $args = ref $_[0] ? shift : {@_};
269             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
270             my $num = _get_n($self, $args);
271             my $v = $args->{'length'} || croak __PACKAGE__, '::test Must have a v-nome length greater than zero';
272             # Sanitize length in proportion to size:
273             croak __PACKAGE__, "::test Sequence 'length' ($v) for testing should be no more than the sample-size ($num) - 2" if $v >= $num - 1;
274             #my $lim = Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
275             my $circularize = defined $args->{'circularize'} ? $args->{'circularize'} : 1;
276             my $delta = defined $args->{'delta'} ? $args->{'delta'} : 2;
277             my $t = _get_t($self, $args);
278             my $states_aref = _get_stateslist($data, $args->{'states'});# Get list of unique states as given or from sequence itself
279             my $v_aref = _get_v_ari($v);
280             # While looping through tests of v, v-1 and v-2 (= $v_i) sequence lengths ...
281             my ($v_i, $r_freq, @data_i, %psisq_v) = ();
282             foreach $v_i(@$v_aref) {
283             # counting vnome frequencies: handle circularization by firstly appending elements from the start to the sequence's end:
284             if ($circularize) {
285             @data_i = @$data; # copy the original sequence
286             push(@data_i, @data_i[0 .. $v_i - 2]); # e.g., if v = 3, append elements [0] & [1]; if v = 2, append only 1st value
287             $r_freq = _frequencies(\@data_i, $v_i, $states_aref); # frequencies of each sequence of length $v_i
288             $psisq_v{$v_i} = _psisq_uncirc(scalar @data_i, $t, $v_i, $r_freq);
289             }
290             else {
291             $r_freq = _frequencies($data, $v_i, $states_aref);
292             $psisq_v{$v_i} = _psisq_uncirc(scalar @$data, $t, $v_i, $r_freq);
293             }
294             }
295             my ($psisq, $df) = _sel_psisq_and_df($v, $t, \%psisq_v, $delta);
296             return wantarray ? ($psisq, $df) : $psisq;
297             }
298              
299             =head2 z_value, vnomes_zscore, vzs, zscore
300              
301             $val = $vnomes->z_value(); # data already loaded, use default windows and prob
302             $val = $vnomes->z_value(data => $aref);
303             ($zvalue, $pvalue) = $vnomes->z_value(data => $aref, tails => 2); # same but wanting an array, get the p-value too
304              
305             Returns the zscore not from a direct test of deviation but from the (inverse-phi) of the L<p_value|Statistics::Sequences::Vnomes/p_value>, using L<Math::Cephes|Math::Cephes> C<ndtri>. Called in array context, also returns the I<p>-value itself. Same options and conditions as above. Other options are B<precision_s> (for the z_value) and B<precision_p> (for the p_value).
306              
307             =cut
308              
309             sub z_value {
310             my $self = shift;
311             my $args = ref $_[0] ? shift : {@_};
312             $args->{'tails'} ||= 1;
313             my $pval = $self->p_value($args);
314             require Statistics::Zed;
315             my $zed = Statistics::Zed->new();
316             my $zval = $zed->p2z(value => $pval, %$args);
317             return wantarray ? ($zval, $pval) : $zval;
318             }
319             *vzs = \&z_value;
320             *vnomes_zscore = \&z_value;
321             *zscore = \&z_value;
322              
323             =head2 p_value, test, vnomes_test, vnt
324              
325             $p = $vnomes->p_value(); # using loaded data and default args
326             $p = $vnomes->p_value(data => [1, 0, 1, 1, 0], exact => 1); # using given data (by-passing load and read)
327             $p = $vnomes->p_value(trials => 20, observed => 10); # without using data
328              
329             Returns probability of obtaining the psisq value for data already L<load|Statistics::Sequences/load, add, unload>ed, or directly keyed as B<data>. The I<p>-value is read off the complemented chi-square distribution (incomplete gamma integral) using L<Math::Cephes|Math::Cephes> C<igamc>.
330              
331             =cut
332              
333             sub p_value {
334             my $self = shift;
335             my $args = ref $_[0] ? shift : {@_};
336             my ($psisq, $df, $p_value) = $self->psisq($args);
337             $p_value = Math::Cephes::igamc($df/2, $psisq/2);
338             $args->{'tails'} ||= 1;
339             $p_value /= 2 if $args->{'tails'} == 1;
340             return $p_value;
341             }
342             *test = \&p_value;
343             *vnomes_test = \&p_value;
344             *vnt = \&p_value;
345              
346             =head2 dump
347              
348             $vnomes->dump(length => 3, values => {psisq => 1, p_value => 1}, format => 'table|labline|csv', flag => 1, precision_s => 3, precision_p => 7, verbose => 1, tails => 1);
349              
350             Print Vnome-test results to STDOUT. See L<dump|Statistics::Sequences/dump> in the Statistics::Sequences manpage for details. If B<verbose> => 1, then you get (1) the actual test-statistic depending on the value of B<delta> tested (I<delta^2-psi^2> for the second difference measure (default), I<delta-psi^2> for the first difference measure, and I<psi2> for the raw measure), followed by degrees-of-freedom in parentheses; and (2) a warning, if relevant, that your B<length> value might be too large with respect to the sample size (see NIST reference, above, in discussing B<length>). If B<text> => 1, you just get the average observed and expected frequencies for each v-nome, the I<Z>-value, and its associated I<p>-value.
351              
352             =cut
353              
354             sub dump {
355             my $self = shift;
356             my $args = ref $_[0] ? $_[0] : {@_};
357             $args->{'stat'} = 'vnomes';
358             $args->{'stat'} .= " ($args->{'length'})" if defined $args->{'length'};
359             $self->SUPER::dump($args);
360             if ($args->{'verbose'} && $args->{'format'} eq 'table') {
361             if ($args->{'values'}->{'psisq'}) {
362             my $delta = defined $args->{'delta'} ? $args->{'delta'} : 2;
363             my ($psisq, $df) = $self->psisq($args);
364             my $df_str = 'degree';
365             $df_str .= 's' if $df != 1;
366             if ($delta == 0) { # Raw psisq:
367             print "psisq is the Kendall-Babington Smith statistic, calculated without backward differencing, and has $df $df_str of freedom.\n";
368             }
369             elsif ($delta == 1) { # First backward difference:
370             print "psisq is Good's delta-psi^2, calculated with first backward differences, and has $df $df_str of freedom.\n";
371             }
372             else {
373             print "psisq is Good's delta^2-psi^2, calculated with second backward differences, and has $df $df_str of freedom.\n";
374             }
375             }
376             if ($args->{'values'}->{'p_value'}) {
377             print "p-value is $args->{'tails'}-tailed.\n";
378             }
379             }
380             return $self;
381             }
382              
383             =head2 nnomes
384              
385             $r = $vnomes->nnomes(length => int, states => \@ari); # minimal option required; the "v" value itself, with number of states
386             $r = $vnomes->nnomes(length => int); # minimal option required, assuming data are loaded so can read its number of states
387             $r = $vnomes->nnomes(length => int, data=> \@data); # minimal option required; the "v" value itself, with these data
388              
389             Returns the number of possible subsequences of the given length (I<v>) for the given number of states (I<t>). This is quantity denoted as I<r> in Good's (1953, 1957) papers; i.e.,
390              
391             =for html <p>&nbsp;&nbsp;<i>r[V]</i> = <i>t</i><sup><i>v</i></sup>
392              
393             The routine needs to know two things: the "v" value itself, i.e., the length of the possible subsequences to test (I<mono>nomes, I<di>nomes, I<tri>nomes, etc.), and the number of states (events, letters, etc.) that the process generating the data could take (from 1 to whatever). The former is always required to be specified by the named argument B<length>. The latter (the number of states) can be directly taken from the named argument B<nstates>, indirectly from the size of the array referenced to the named argument B<states>, from whatever states exist in the array referenced to the named argument B<data>, or from whatever states exist in data already L<load|Statistics::Sequences::Vnomes/load>ed.
394              
395             =cut
396              
397             sub nnomes {
398             my $self = shift;
399             my $args = ref $_[0] ? $_[0] : {@_};
400             my $t = _get_t($self, $args);
401             my $v = $args->{'length'} || croak __PACKAGE__, '::nnomes Must define argument \'length\' with a value greater than zero';
402             return $t**$v;
403             }
404              
405             =head2 prob_r
406              
407             $Pr = $vnomes->prob_r(length => $v); # length is 1 (mononomes) by default
408              
409             Returns the probability of the occurrence of any of the individual elements ("digits") in the sequence (I<v> = 1), or of the given B<length>, assuming they are equally likely and independent.
410              
411             =for html <p>&nbsp;&nbsp;<i>P<sub>r</sub></i> = <i>t</i><sup><i>&ndash;v</i></sup>
412              
413             =cut
414              
415             sub prob_r {
416             my $self = shift;
417             my $args = ref $_[0] ? $_[0] : {@_};
418             my $t = _get_t($self, $args);
419             my $v = $args->{'length'} || 1;
420             return $t**(-1 * $v);
421             }
422              
423             =head1 OPTIONS
424              
425             Options common to the above stats methods.
426              
427             =head2 length
428              
429             This is currently a I<required> "option", giving the length of the v-nome of interest, i.e., the value of I<v> - an integer greater than or equal to 1, and smaller than than the sample-size.
430              
431             What is a meaningful maximal value of B<length>? As a I<chi>-square test, it is conventionally required that the expected frequency is at least 5 for each v-nome (Knuth, 1988). This can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2010) recommends that length should be less than the floored value of log2 of the sample-size, minus 2. No tests are here made of these recommendations.
432              
433             =head2 circularize
434              
435             By default, L<observed|Statistics::Sequences::Vnomes/observed, vnomes_observed, vno> and L<expected|Statistics::Sequences::Vnomes/expected, vnomes_expected, vne> counts, and the value of L<psisq|Statistics::Sequences::Vnomes/psisq>, are made by treating the sequence as a cyclic one, where the first element of the sequence follows the last one. This affects (and simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Also, circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) state, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103). These methods, however, can also treat the sequence non-cyclically by calling them with B<circularize> => 0.
436              
437             =head2 states
438              
439             Optionally send a referenced array listing the unique states (or 'events', 'letters') in the population from which the sequence was sampled, e.g., B<states> => [qw/A C G T/]. This is useful if the sequence itself might not include all the possible states. If this is not specified, the states are identified from the sequence itself. If giving a list of states, a check in each test is made to ensure that the sequence contains I<only> those elements in the list.
440              
441             =cut
442              
443             # PRIVATMETHODEN
444             sub _count_lim { # N - v + 1; if N = 5 and v = 3, starting the count-up for v-nomes must stop from the first 3 elements of the data
445             return $_[0] - $_[1] + 1;
446             }
447              
448             sub _get_n { # the size of the sequence
449             my ($self, $args) = @_;
450             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
451             return scalar @$data;
452             }
453              
454             sub _get_t { # the number of unique states; e.g., 2 for a binary sequences; 10 for a typical random digit sequence (0 .. 9)
455             my ($self, $args) = @_;
456             my $t;
457             if ($args->{'nstates'}) {
458             $t = $args->{'nstates'};
459             }
460             elsif (ref $args->{'states'} and ref $args->{'states'} eq 'ARRAY') {
461             $t = scalar(@{$args->{'states'}});
462             }
463             else {
464             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
465             my %hash = map { $_, 1 } @{$data};
466             my $states = [keys %hash];
467             $t = scalar(@{$states});
468             }
469             return $t;
470             }
471              
472             sub _get_stateslist {
473             my ($data, $states) = @_;
474             if (! ref $states) { # Get states from the data themselves:
475             my %hash = map { $_, 1 } @{$data};
476             $states = [keys %hash];
477             }
478             else { # Ensure that the data only contain states in the given list:
479             my ($g, $h) = ();
480             DATA:
481             foreach $g(@{$data}) {
482             foreach $h(@{$states}) {
483             next DATA if $h eq $g;
484             }
485             croak __PACKAGE__, "::test The element $g in the data is not represented in the given states";
486             }
487             }
488             #croak __PACKAGE__, '::test At least two different values must be in the sequence to test its sub-sequences' if $t <= 1;
489             return $states;
490             }
491              
492             sub _frequencies {
493             my ($data_i, $v_i, $states_aref) = @_;
494             # Get a list of all possible combinations of states at the current length ($v_i):
495             my @variations = variations_with_repetition($states_aref, $v_i);
496             # Count up the frequency of each variation in the data:
497             my $num = scalar(@{$data_i});
498             my ($i, $probe_str, $test_str, %r_freq) = ();
499             foreach (@variations) {
500             $probe_str = join'', @{$_};
501             $r_freq{$probe_str} = 0;
502             for ($i = 0; $i < _count_lim($num, $v_i); $i++) {
503             $test_str = join'', @{$data_i}[$i .. ($i + $v_i - 1)];
504             $r_freq{$probe_str}++ if $probe_str eq $test_str;
505             }
506             }# print "FREQ:\n"; while (my($key, $val) = each %freq) { print "\t$key = $val\n"; } print "\n";
507             return \%r_freq;
508             }
509            
510             sub _sel_psisq_and_df {# get psisq and its df
511             my ($v, $t, $psisq_v, $delta) = @_;
512             my ($psisq, $df) = ();
513            
514             if ($v == 1) { # psisq is asymptotically distributed chisq, can use psisq for chisq distribution:
515             $psisq = $psisq_v->{1};
516             $df = $t - 1;
517             }
518             else {
519             if ($delta == 0) { # Raw psisq:
520             $psisq = $psisq_v->{$v};
521             $df = $t**$v - 1; # Good (1957, Eq. 6) - if circularized or not
522             }
523             elsif ($delta == 1) { # First backward difference:
524             $psisq = $psisq_v->{$v} - ($v - 1 <= 0 ? 0 : $psisq_v->{$v - 1});
525             $df = $t**$v - $t**($v - 1);
526             }
527             else { # $delta == 2 # Second backward difference (default):
528             $psisq = $psisq_v->{$v} - ( 2 * ($v - 1 <= 0 ? 0 : $psisq_v->{$v - 1}) ) + ($v - 2 <= 0 ? 0 : $psisq_v->{$v - 2});
529             $df = ( $t**($v - 2) ) * ( $t - 1)**2;
530             }
531             }
532             return ($psisq, $df);
533             }
534              
535             sub _get_v_ari {
536             my $v = shift; # Init a hash to keep the psi-square values for the v, v-1, and v-2 ( = $v_i) sequence lengths, where relevant:
537             my @ari = ();
538             foreach (0 .. 2) {
539             $v > $_ ? push @ari, $v - $_ : last;
540             } # print "v ari = ", join(' ', @ari), "\n"; #push @ari, $v - 1 if $v >= 2; #push @ari, $v - 2 if $v >= 3;
541             return \@ari;
542             }
543              
544             sub _psisq_uncirc {# Compute psi^2 for uncircularized sequence from freq of each variation of length $v for given states(Good, 1953, Eq. 1):
545             my ($n, $t, $v, $r_freq) = @_;
546             my $k = ($n - $v + 1) / $t**$v;
547             my $sum = sum( map{ ($_ - $k)**2 } values %{$r_freq});#foreach (keys %{$r_freq}) {$psisq += ($r_freq->{$_} - $k)**2;}
548             return $sum / $k;
549             }
550              
551             sub _psisq_circ {
552             my ($n, $t, $v, $r_freq) = @_;# Compute psi^2 for circularized sequence from freq of each variation of length $v for given states(Good, 1953, Eq. 2):
553             my $k = $n * $t**(-1 * $v);
554             my $sum = sum( map{ ($_ - $k) } values %{$r_freq});#foreach (keys %{$r_freq}) { $sum += ($r_freq->{$_} - $k)**2;}
555             return $sum / $k;
556             }
557              
558             __END__
559              
560             =head1 EXAMPLE
561              
562             =head2 Seating at the diner
563              
564             This is the data from Swed and Eisenhart (1943) also given as an example for the L<Runs test|Statistics::Sequences::Runs/EXAMPLE> and L<Turns test|Statistics::Sequences::Turns/EXAMPLE>. It lists the occupied (O) and empty (E) seats in a row at a lunch counter.
565             Have people taken up their seats on a random basis - or do they show some social phobia, or are they trying to pick up? What does the test of Vnomes reveal?
566              
567             use Statistics::Sequences::Vnomes;
568             my $vnomes = Statistics::Sequences::Vnomes->new();
569             my @seating = (qw/E O E E O E E E O E E E O E O E/);
570             $vnomes->load(\@seating);
571             $vnomes->dump(length => 3, values => {z_value => 1, p_value => 1}, format => 'labline', flag => 1, precision_s => 3, precision_p => 3, tails => 1);
572              
573             This prints:
574              
575             z_value = 2.015, p_value = 0.022*
576              
577             That is, the observed frequency of each possible trio of seating arrangements (the trinomes OOO, OOE, OEE, EEE, etc.) differed significantly from that expected. Look up the observed frequencies for each possible trinome to see if this is because there are more empty or occupied neighbouring seats ("phobia" or "philia"):
578              
579             $vnomes->dump(length => 3, values => {observed => 1}, format => 'labline');
580              
581             This prints:
582              
583             observed = ('OEE' = 4,'EEO' = 4,'EEE' = 2,'OEO' = 1,'EOE' = 5,'OOO' = 0,'OOE' = 0,'EOO' = 0)
584              
585             As the chance-expected frequency is 2.5 (from the L<expected|Statistics::Sequences::Vnomes/expected, vnomes_expected, vne> method), there are clearly more than expected trinomes involving empty seats than occupied seats - suggesting a non-random factor like social phobia (or body odour?) is at work in sequencing people's seating here. Noting that the sequencing isn't significant for dinomes (with B<length> => 2) might also tell us something about what's going on. What happens for v-nomes of 4 or more in length? Maybe the L<runs|Statistics::Sequences::Runs> or L<pot|Statistics::Sequences::Pot> test might be a better summary of what's going on.
586              
587             =head1 REFERENCES
588              
589             Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. I<Journal of Parapsychology>, I<38>, 393-407.
590              
591             Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. I<Psychological Bulletin>, I<94>, 166-176.
592              
593             Gatlin, L. L. (1979). A new measure of bias in finite sequences with applications to ESP data. I<Journal of the American Society for Psychical Research>, I<73>, 29-43. (Used for one of the reference tests in the CPAN distribution.)
594              
595             Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. I<Mathematical Proceedings of the Cambridge Philosophical Society>, I<49>, 276-284.
596              
597             Good, I. J. (1957). On the serial test for random sequences. I<Annals of Mathematical Statistics>, I<28>, 262-264.
598              
599             Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. I<Journal of the Royal Statistical Society A>, I<130>, 102-107.
600              
601             Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. I<Journal of the Royal Statistical Society>, I<101>, 147-166.
602              
603             Knuth, D. E. (1998). I<The art of computer programming> (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.
604              
605             Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2010). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from L<http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22b.pdf>, and July 17, 2013, from L<http://csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf|SP800-22rev1a.pdf> (revised).
606              
607             =head1 SEE ALSO
608              
609             L<Statistics::Sequences|Statistics::Sequences> sub-modules for other tests of sequences, and for sharing data between these tests.
610              
611             =head1 TO DO/BUGS
612              
613             Handle non-overlapping v-nomes.
614              
615             =head1 AUTHOR/LICENSE
616              
617             =over 4
618              
619             =item Copyright (c) 2006-2013 Roderick Garton
620              
621             rgarton AT cpan DOT org
622              
623             This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L<http://www.perl.com/perl/misc/Artistic.html>).
624              
625             =back
626              
627             =head1 DISCLAIMER
628              
629             To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.
630              
631             =head1 END
632              
633             This ends documentation of the Perl implementation of the I<psi>-square statistic, Kendall-Babington Smith test, and Good's Generalized Serial Test, for randomness in a sequence.
634              
635             =cut