File Coverage

blib/lib/Statistics/Sequences/Joins.pm
Criterion Covered Total %
statement 16 18 88.8
branch n/a
condition n/a
subroutine 6 6 100.0
pod n/a
total 22 24 91.6


line stmt bran cond sub pod time code
1             package Statistics::Sequences::Joins;
2              
3 1     1   21596 use 5.008008;
  1         3  
  1         32  
4 1     1   6 use strict;
  1         1  
  1         29  
5 1     1   6 use warnings;
  1         5  
  1         27  
6 1     1   4 use Carp qw(carp croak);
  1         1  
  1         76  
7 1     1   4 use vars qw($VERSION @ISA);
  1         7  
  1         45  
8 1     1   439 use Statistics::Sequences 0.10;
  0            
  0            
9             @ISA = qw(Statistics::Sequences);
10             use List::AllUtils qw(true uniq);
11             use Statistics::Zed 0.072;
12             our $zed = Statistics::Zed->new();
13              
14             $VERSION = '0.11';
15              
16             =pod
17              
18             =head1 NAME
19              
20             Statistics::Sequences::Joins Wishart-Hirshfeld statistics for number of alternations between two elements of a dichotomous sequence
21              
22             =head1 SYNOPSIS
23              
24             use strict;
25             use Statistics::Sequences::Joins 0.11; # methods/args here are not compatible with earlier versions
26             my $joins = Statistics::Sequences::Joins->new();
27             $joins->load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); # dichotomous sequence (any values); or send as "data => $aref" with each stat call
28             my $val = $joins->observed(); # other methods include: expected(), variance(), obsdev() and stdev()
29             $val = $joins->expected(trials => 20); # by-passing need for data; also works with other methods except observed()
30             $val = $joins->z_value(tails => 1, ccorr => 1); # or want an array & get back both z- and p-value
31             $val = $joins->z_value(trials => 20, observed => 10, tails => 1, ccorr => 1); # by-pass need for data; also works with p_value()
32             $val = $joins->p_value(tails => 1); # assuming data are loaded; alias: test()
33             my $href = $joins->stats_hash(values => {observed => 1, p_value => 1}); # include any other stat-method as needed
34             $joins->dump(values => {observed => 1, expected => 1, p_value => 1}, format => 'line', flag => 1, precision_s => 3, precision_p => 7);
35             # prints: observed = 10.000, expected = 9.500, p_value = 1.0000000
36              
37             =head1 DESCRIPTION
38              
39             Joins are similar to L<runs|Statistics::Sequences::Runs> but are counted for every alternation between dichotomous events (state, element, letter ...) whereas runs are counted for each continuous segment between alternations. For example, joins are marked out with asterisks for the following sequence:
40              
41             0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0
42             * * * * * * * *
43              
44             So there's a join (of 0 and 1) at indices 1 and 2, then immediately another join (of 1 and 0) at indices 2 and 3, and then another join at 5 and 6 ... for a total count of eight joins.
45              
46             There are methods to get the observed and expected joincounts, and the expected variance in joincount. Counting up the observed number of joins needs some data to count through, but getting the expectation and variance for the joincount - if not sent actual data in the call, or already cached via L<load|load> - can just be fed with the number of trials, and, optionally, the binomial event probability (of one of the two events occurring; default = 0.50). Note that this also differs from the way runs are counted: the expected joincount, and its variance, where the relative frequencies of the two events are counted off the given data (although this option is availabe for figuring out the binomial probability here, too).
47              
48             Have non-dichotomous, continuous or multinomial data? See L<Statistics::Data::Dichotomize> for how to prepare non-dichotomous data, whether numerical or made up of categorical events, for test of joins.
49              
50             =head1 METHODS
51              
52             Methods are those described in L<Statistics::Sequences>, but can be used directly from this module, as follows.
53              
54             =head2 new
55              
56             $joins = Statistics::Sequences::Joins->new();
57              
58             Returns a new Joins object. Expects/accepts no arguments but the classname.
59              
60             =head2 load
61              
62             $joins->load(@data); # anonymously
63             $joins->load(\@data);
64             $joins->load('sample1' => \@data); # labelled whatever
65              
66             Loads data anonymously or by name - see L<load|Statistics::Data/load, load_data> in the Statistics::Data manpage for details on the various ways data can be loaded and then retrieved (more than shown here).
67              
68             After the load, the data are L<read|Statistics::Data/read, read_data, get_data> to ensure that they contain only two unique elements - if not, carp occurs and 0 rather than 1 is returned.
69              
70             Alternatively, skip this action; data don't always have to be loaded to use the stats methods here. To get the observed number of joins, data of course have to be loaded, but other stats can be got if given the observed count - otherwise, they too depend on data having been loaded.
71              
72             Every load unloads all previous loads and any additions to them.
73              
74             =cut
75              
76             sub load {
77             my $self = shift;
78             $self->SUPER::load(@_);
79             my $data = $self->read(@_);
80             my $nuniq = scalar(uniq(@{$data}));
81             if ($nuniq > 2) {
82             carp __PACKAGE__, ' More than two elements were found in the data: ' . join(' ', uniq(@$data));
83             return 0;
84             }
85             else {
86             return 1;
87             }
88             }
89              
90             =head2 add, read, unload
91              
92             See L<Statistics::Data> for these additional operations on data that have been loaded.
93              
94             =head2 observed, joincount_observed, jco
95              
96             $count = $joins->observed(); # assumes data have already been loaded
97             $count = $joins->observed(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1]); # assumes window = 1
98              
99             Returns the number of joins in a sequence - i.e., when, from trial 2 on, the event on trial I<i> doesn't equal the event on trial I<i> - 1. So the following sequence adds up to 7 joins like this:
100              
101             Sequence: 1 0 0 0 1 0 0 1 0 1 1 0
102             JoinCount: 0 1 1 1 2 3 3 4 5 6 6 7
103              
104             The data to test can already have been L<load|Statistics::Sequences/load>ed, or you send it directly keyed as C<data>.
105              
106             =cut
107              
108             sub observed {
109             my $self = shift;
110             my $args = ref $_[0] ? shift : {@_};
111             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
112             my $num = scalar(@$data);
113             my ($jco, $i) = (0);
114             foreach ($i = 1; $i < $num; $i++) {
115             $jco++ if $data->[$i] ne $data->[$i - 1]; # increment count if event is not same as last
116             }
117             return $jco;
118             }
119             *joincount_observed = \&observed;
120             *jco = \&observed;
121              
122             =head2 expected, joincount_expected, jce
123              
124             $val = $joins->expected(); # assumes data already loaded, uses default prob value (.5)
125             $val = $joins->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1]); # count these data, use default prob value (.5)
126             $val = $joins->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1], prob => .2); # count these data, use given prob value
127             $val = $joins->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1], state => 1); # count off trial numbers and prob. of event
128             $val = $joins->expected(prob => .2, trials => 10); # use this trial number and probability of one of the 2 events
129              
130             Returns the expected number of joins between every element of the given data, or for data of the given attributes, using.
131              
132             =for html <p>&nbsp;&nbsp;<i>E[J]</i> = 2(<i>N</i> &ndash; 1)<i>p</i><i>q</i>
133              
134             where I<N> is the number of observations/trials (width = 1 segments),
135              
136             I<p> is the expected probability of the joined event taking on its observed value, and
137              
138             I<q> is (1 - I<p>), the expected probability of the joined event I<not> taking on its observed value.
139              
140             The data to test can already have been L<load|Statistics::Sequences/load>ed, or you send it directly keyed as C<data>. The data are only needed to count off the number of trials, and the proportion of 1s (or other given state of the two), if the C<trials> and C<prob> attributes are not defined. If C<state> is defined, then C<prob> is worked out from the actual data (as long as there are some, or 1/2 is assumed). If C<state> is not defined, C<prob> takes the value you give to it, or, if it too is not defined, then 1/2.
141              
142             Counting up the observed number of joins needs some data to count through, but getting the expectation and variance for the joincount can just be fed with the number of C<trials>, and, optionally, the C<prob>ability of one of the two events.
143              
144             =cut
145              
146             sub expected {
147             my $self = shift;
148             my $args = ref $_[0] ? shift : {@_};
149             my ($num, $pex) = _get_trial_N($self, $args);
150             return 2 * ($num - 1) * $pex * (1 - $pex);
151             }
152             *jce = \&expected;
153             *joincount_expected = \&expected;
154              
155             =head2 variance, joincount_variance, jcv
156              
157             $val = $joins->variance(); # assume data already "loaded" for counting
158             $val = $joins->variance(data => $aref); # use inplace array reference, will use default prob of 1/2
159             $val = $joins->variance(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1], state => 1); # count off trial numbers and prob. of event
160             $val = $joins->variance(trials => number, prob => prob); # use this trial number and probability of one of the 2 events
161              
162             Returns the expected variance in the number of joins for the given data.
163              
164             =for html <p>&nbsp;&nbsp;<i>V[J]</i> = 4<i>N</i><i>p</i><i>q</i>(1 &ndash; 3<i>p</i><i>q</i>) &ndash; 2<i>p</i><i>q</i>(3 &ndash; 10<i>p</i><i>q</i>)
165              
166             defined as above for L<joincount_expected|Statistics::Sequences::Joins/expected, joincount_expected, jce>.
167              
168             The data to test can already have been L<load|Statistics::Sequences/load>ed, or you send it directly keyed as C<data>. The data are only needed to count off the number of trials, and the proportion of 1s (or other given state of the two), if the C<trials> and C<prob> attributes aren't defined. If C<state> is defined, then C<prob> is worked out from the actual data (as long as there are some, or expect a C<croak>). If C<state> is not defined, C<prob> takes the value you give to it, or, if it too is not defined, then 1/2.
169              
170             =cut
171              
172             sub variance {
173             my $self = shift;
174             my $args = ref $_[0] ? shift : {@_};
175             my ($num, $pex) = _get_trial_N($self, $args);
176             my $pq = $pex * (1 - $pex);
177             return ( 4 * $num * $pq ) * (1 - ( 3 * $pq ) ) - ( ( 2 * $pq ) * (3 - ( 10 * $pq ) ) );
178             }
179             *jcv = \&variance;
180             *joincount_variance = \&variance;
181              
182             =head2 obsdev, observed_deviation
183              
184             $v = $joins->obsdev(); # use data already loaded - anonymously; or specify its "label" or "index" - see observed()
185             $v = $joins->obsdev(data => [qw/blah bing blah blah blah/]); # use these data
186              
187             Returns the deviation of (difference between) observed and expected joins for the loaded/given sequence (I<O> - I<E>).
188              
189             =cut
190              
191             sub obsdev {
192             return observed(@_) - expected(@_);
193             }
194             *observed_deviation = \&obsdev;
195              
196             =head2 stdev, standard_deviation
197              
198             $v = $joins->stdev(); # use data already loaded - anonymously; or specify its "label" or "index" - see observed()
199             $v = $joins->stdev(data => [qw/blah bing blah blah blah/]);
200              
201             Returns square-root of the variance.
202              
203             =cut
204              
205             sub stdev {
206             return sqrt(variance(@_));
207             }
208             *standard_deviation = \&stdev;
209              
210             =head2 z_value, joincount_zscore, jzs, zscore
211              
212             $val = $joins->z_value(); # data already loaded, use default windows and prob
213             $val = $joins->z_value(data => $aref, prob => .5, ccorr => 1);
214             ($zvalue, $pvalue) = $joins->z_value(data => $aref, prob => .5, ccorr => 1, tails => 2); # same but wanting an array, get the p-value too
215              
216             Returns the zscore from a test of joincount deviation, taking the joincount expected away from that observed and dividing by the root expected joincount variance, by default with a continuity correction to expectation. Called wanting an array, returns the z-value with its p-value for the tails (1 or 2) given.
217              
218             The data to test can already have been L<load|Statistics::Sequences/load>ed, or you send it directly keyed as C<data>.
219              
220             Other options are C<precision_s> (for the z_value) and C<precision_p> (for the p_value).
221              
222             =cut
223              
224             sub z_value {
225             my $self = shift;
226             my $args = ref $_[0] ? shift : {@_};
227             my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
228             my $jco = defined $args->{'observed'} ? $args->{'observed'} : $self->jco($args);
229             my $pex = defined $args->{'prob'} ? $args->{'prob'} : .5;
230             my $num = defined $args->{'trials'} ? $args->{'trials'} : scalar(@{$data});
231             my $ccorr = defined $args->{'ccorr'} ? $args->{'ccorr'} : 1;
232             my $tails = $args->{'tails'} || 2;
233             my ($zval, $pval) = $zed->zscore(
234             observed => $jco,
235             expected => $self->jce(prob => $pex, trials => $num),
236             variance => $self->jcv(prob => $pex, trials => $num),
237             ccorr => $ccorr,
238             tails => $tails,
239             precision_s => $args->{'precision_s'},
240             precision_p => $args->{'precision_p'},
241             );
242             return wantarray ? ($zval, $pval) : $zval;
243             }
244             *jzs = \&z_value;
245             *joincount_zscore = \&z_value;
246             *zscore = \&z_value;
247              
248             =head2 p_value, test, joins_test, jct
249              
250             $p = $joins->p_value(); # using loaded data and default args
251             $p = $joins->p_value(ccorr => 0|1, tails => 1|2); # normal-approximation based on loaded data
252             $p = $joins->p_value(data => [1, 0, 1, 1, 0], exact => 1); # using given data (by-passing load and read)
253             $p = $joins->p_value(trials => 20, observed => 10); # without using data, specifying its size and join-count
254              
255             Test data for significance of the number of joins by deviation ratio (obsdev / stdev). Returns the Joins object, lumped with a C<z_value>, C<p_value>, and the descriptives C<observed>, C<expected> and C<variance>. Data are those already L<load|Statistics::Sequences/load>ed, or directly keyed as C<data>
256              
257             =cut
258              
259             sub p_value {
260             return (z_value(@_))[1];
261             }
262             *test = \&p_value;
263             *joins_test = \&p_value;
264             *jct = \&p_value;
265              
266             =head2 stats_hash
267              
268             $href = $joins->stats_hash(values => {observed => 1, expected => 1, variance => 1, z_value => 1, p_value => 1}, prob => .5, ccorr => 1);
269              
270             Returns a hashref for the counts and stats as specified in its "values" argument, and with any options for calculating them (e.g., exact for p_value). See L<Statistics::Sequences/stats_hash> for details. If calling via a "joins" object, the option "stat => 'joins'" is not needed (unlike when using the parent "sequences" object).
271              
272             =head2 dump
273              
274             $joins->dump(values => { observed => 1, variance => 1, p_value => 1}, exact => 1, flag => 1, precision_s => 3); # among other options
275              
276             Print Joins-test results to STDOUT. See L<dump|Statistics::Sequences/dump> in the Statistics::Sequences manpage for details.
277              
278             =cut
279              
280             sub dump {
281             my $self = shift;
282             my $args = ref $_[0] ? $_[0] : {@_};
283             $args->{'stat'} = 'joins';
284             $self->SUPER::dump($args);
285             }
286              
287             sub _get_trial_N {
288             my ($self, $args, $n, $data) = @_;
289             if (defined $args->{'trials'}) {
290             $n = $args->{'trials'};
291             }
292             else {
293             $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
294             $n = scalar(@$data);
295             }
296             my $p = defined $args->{'prob'} ? $args->{'prob'} : (defined $args->{'state'} and defined $data) ? _count_pfrq($self, $data, $args->{'state'}) : .5;
297             return ($n, $p);
298             }
299              
300             sub _count_pfrq {
301             my ($self, $aref, $state, $count) = @_;
302             return .5 if ! ref $aref or ! scalar(@$aref);
303             $count++ if true { $_ eq $state } @{$aref};
304             return $count / scalar(@{$aref});
305             }
306              
307             1;
308              
309             __END__
310              
311             =head1 EXAMPLE
312              
313             Here the problem is to assess the degree of consistency of in number of matches between target and response obtained in each of 200 runs of 25 trials each. The number of matches expected on the basis of chance is 5 per run. To test for sustained high or low scoring sequences, a join is defined as the point at which a score on one side of this value (4, 3, 2, etc.) is followed by a score on the other side (6, 7, 8, etc.). Ignoring scores equalling the expectation value (5), the probability of a join is 1/2, or 0.5 (the default value to L<test|test>), assuming that, say, a score of 4 is as likely as a score of 6, and anything greater than a deviation of I<5> (from 5) is improbable/impossible.
314              
315             use Statistics::Sequences;
316              
317             # Conduct pseudo identification 5 x 5 runs:
318             my ($i, $hits, $stimulus, $response, @scores);
319             foreach ($i = 0; $i < 200; $i++) {
320             $scores[$i] = 0;
321             for (0 .. 24) {
322             $stimulus = (qw/circ plus rect star wave/)[int(rand(5))];
323             $response = (qw/circ plus rect star wave/)[int(rand(5))];
324             $scores[$i]++ if $stimulus eq $response;
325             }
326             }
327              
328             my $seq = Statistics::Sequences->new();
329             $seq->load(@scores);
330             $seq->cut(value => 5, equal => 0); # value is the expected number of matches (Np); ignoring values equal to this
331             $seq->test(stat => 'joins', tails => 1, ccorr => 1)->dump(text => 1, flag => 1);
332             # prints, e.g., Joins: expected = 79.00, observed = 67.00, Z = -1.91, 1p = 0.028109*
333              
334             =head1 REFERENCES
335              
336             Wishart, J. & Hirshfeld, H. O. (1936). A theorem concerning the distribution of joins between line segments. I<Journal of the London Mathematical Society>, I<11>, 227.
337              
338             =head1 SEE ALSO
339              
340             L<Statistics::Sequences::Runs|lib::Statistics::Sequences::Runs> : Analogous test.
341              
342             L<Statistics::Sequences::Pot|lib::Statistics::Sequences::Pot> : Another concept of sequential structure.
343              
344             =head1 BUGS/LIMITATIONS
345              
346             No computational bugs as yet identfied. Hopefully this will change, given time.
347              
348             =head1 REVISION HISTORY
349              
350             See CHANGES in installation dist for revisions.
351              
352             =head1 AUTHOR/LICENSE
353              
354             =over 4
355              
356             =item Copyright (c) 2006-2013 Roderick Garton
357              
358             rgarton AT cpan DOT org
359              
360             This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L<http://www.perl.com/perl/misc/Artistic.html>).
361              
362             =item Disclaimer
363              
364             To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.
365              
366             =back
367              
368             =cut