File Coverage

blib/lib/DBSchema/Normalizer.pm
Criterion Covered Total %
statement 10 12 83.3
branch n/a
condition n/a
subroutine 4 4 100.0
pod n/a
total 14 16 87.5


line stmt bran cond sub pod time code
1             ############################################################
2             #
3             # DBSchema::Normalizer - a MySQL database table normalizer
4             #
5             # Copyright (c) 2001 by Giuseppe Maxia
6             # Produced under the GPL (Golden Perl Laziness)
7             # Distributed under the GPL (GNU Geneal Public License)
8             #
9             ############################################################
10              
11             require 5.004;
12              
13             =head1 NAME
14              
15             DBSchema::Normalizer - database normalization. - Convert a table from 1st to 2nd normal form
16              
17             =head1 SYNOPSIS
18              
19             # the easy way is to give all parameters to the constructor
20             # and then call do()
21             #
22             use DBSchema::Normalizer;
23             my $norm = DBSchema::Normalizer->new (
24             {
25             DSN => $DSN,
26             username => $username,
27             password => $password,
28             src_table => $sourcetable,
29             index_field => $indexfield,
30             lookup_fields => $lookupfields, # comma separated list
31             lookup_table => $lookuptable,
32             dest_table => $dest_table,
33             copy_indexes => "yes",
34             });
35             $norm->do(); # Just Do It!
36              
37             # Alternatively, you can have some more control, by
38             # creating the lookup table and normalized table separately,
39             # especially useful if one of them is an intermediate step.
40             #
41             use DBSchema::Normalizer qw(create_lookup_table create_normalized_table);
42             my $norm = DBSchema::Normalizer->new(
43             {
44             DSN => $DSN,
45             username => $username,
46             password => $password
47             });
48             $norm->create_lookup_table (
49             {
50             src_table => $tablename,
51             index_field => $indexfield,
52             lookup_fields => $lookupfields,
53             lookup_table => $lookuptable
54             });
55             $norm->create_normalized_table (
56             {
57             src_table => $tablename,
58             index_field => $indexfield,
59             lookup_fields => $lookupfields,
60             lookup_table => $lookuptable,
61             dest_table => $dest_table,
62             copy_indexes => "yes",
63             });
64              
65             =head1 DESCRIPTION
66              
67             B is a module to help transforming MySQL database tables from 1st to 2nd normal form.
68             Simply put, it will create a lookup table out of a set of repeating fields from a source table, and replace such fields by a foreign key that points to the corresponding fields in the newly created table.
69             All information is taken from the database itself. There is no need to specify existing details.
70             The module is capable of re-creating existing indexes, and should deal with complex cases where the replaced fields are part of a primary key.
71              
72             =head2 Algorithm
73              
74             The concept behind B is based upon some DBMS properties. To replace repeating fields with a foreign key pointing to a lookup table, you must be sure that for each distinct set of values you have a distinct foreign key. You might be tempted to solve the problem with something like this:
75              
76             I. Read all records into memory
77             II. for each record, identify the unique value for the fields to be
78             moved into a lookup table and store it in a hash
79             II. again, for each record, find the corresponding value in the
80             previously saved hash and save the non-lookup fields plus the
81             unique key into a new table
82             IV. for each key in the hash, write the values to a lookup table
83              
84             I can find four objections against such attempt:
85              
86             1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memory, we could just read the records twice from the database and deal with them one at the time. However, even the size of the hash could prove to be too much for our computer memory. A hash of 2_000_000 items is unlikely to handle memory efficiently in most nowadays desktops.
87              
88             2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.
89              
90             3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point fields and timestamp fields with reduced view. Everything can be solved, but it could be a heavy overhead for your sub.
91              
92             4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a join between source and lookup table.
93              
94             That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve. The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed to have a direct relationship with each record of the source table, when compared using the same elements we considered in the SELECT DISTINCT.
95              
96             The algorithm takes four steps:
97              
98             I. create the lookup table
99              
100             CREATE TABLE lookup ({lookupID} INT NOT NULL auto_increment
101             primary key, {LOOKUP FIELDS});
102              
103             #(and add a key for each {LOOKUP FIELDS})
104              
105             II. fill in the lookup table
106            
107             INSERT INTO lookup
108             SELECT DISTINCT NULL {LOOKUP FIELDS} FROM source_table;
109             #(the {lookupID} is automatically created, being auto_increment)
110              
111             III. create the normalized table
112              
113             CREATE TABLE norm_table ({source_table FIELDS} -
114             {LOOKUP FIELDS} + {lookupID})
115              
116             IV. fill in the normalized table
117              
118             INSERT INTO normalized table
119             SELECT {source_table FIELDS} - {LOOKUP FIELDS} + {lookupID}
120             FROM source_table
121             INNER JOIN lookup
122             on (source_table.{LOOKUP FIELDS}= lookup.{LOOKUP FIELDS})
123              
124             As you can see, the entire operation is run in the server workspace, thus avoiding problems of (a) fetching records (less network traffic), (b) handling data conversion, (c) memory occupation and (d) efficiency.
125              
126             Let's see an example.
127              
128             Having a table MP3 with these fields
129              
130             mysql> describe MP3;
131             +----------+-------------+------+-----+----------+----------------+
132             | Field | Type | Null | Key | Default | Extra |
133             +----------+-------------+------+-----+----------+----------------+
134             | ID | int(11) | | PRI | NULL | auto_increment |
135             | title | varchar(40) | | MUL | | |
136             | artist | varchar(20) | | MUL | | |
137             | album | varchar(30) | | MUL | | |
138             | duration | time | | | 00:00:00 | |
139             | size | int(11) | | | 0 | |
140             | genre | varchar(10) | | MUL | | |
141             +----------+-------------+------+-----+----------+----------------+
142             7 rows in set (0.00 sec)
143              
144             We want to produce two tables, the first one having only [ID, title, duration, size], while the second one should get [artist, album, genre]. (The second one will also needed to be further split into [artist] and [album, genre] but we can deal with that later).
145              
146             Here are the instructions to normalize this table:
147              
148             DROP TABLE IF EXISTS tmp_albums;
149             CREATE TABLE tmp_albums (album_id INT NOT NULL AUTO_INCREMENT
150             PRIMARY KEY,
151             artist varchar(20) not null,
152             album varchar(30) not null,
153             genre varchar(10) not null,
154             KEY artist (artist), KEY album (album), KEY genre (genre));
155              
156             INSERT INTO tmp_albums
157             SELECT DISTINCT NULL, artist,album,genre FROM MP3;
158              
159             DROP TABLE IF EXISTS songs;
160             CREATE TABLE songs (ID int(11) not null auto_increment,
161             title varchar(40) not null,
162             duration time not null default '00:00:00',
163             size int(11) not null,
164             album_id INT(11) NOT NULL,
165             PRIMARY KEY (ID), KEY title (title), KEY album_id (album_id));
166            
167             INSERT INTO songs SELECT src.ID, src.title, src.duration,
168             src.size, album_id
169             FROM MP3 src INNER JOIN tmp_albums lkp
170             ON (src.artist =lkp.artist and src.album =lkp.album
171             and src.genre =lkp.genre);
172              
173             Eventually, we can use the same technique to normalize the albums into a proper table.
174              
175             DROP TABLE IF EXISTS artists;
176             CREATE TABLE artists (artist_id INT NOT NULL AUTO_INCREMENT
177             PRIMARY KEY,
178             artist varchar(20) not null,
179             KEY artist (artist)) ;
180            
181             INSERT INTO artists
182             SELECT DISTINCT NULL, artist FROM tmp_albums;
183              
184             DROP TABLE IF EXISTS albums;
185            
186             CREATE TABLE albums (album_id int(11) not null auto_increment,
187             album varchar(30) not null,
188             genre varchar(10) not null,
189             artist_id INT(11) NOT NULL,
190             PRIMARY KEY (album_id),
191             KEY genre (genre), KEY album (album), KEY artist_id (artist_id));
192            
193             INSERT INTO albums
194             SELECT src.album_id, src.album, src.genre, artist_id
195             FROM tmp_albums src
196             INNER JOIN artists lkp ON (src.artist =lkp.artist);
197              
198             mysql> describe artists;
199             +-----------+-------------+------+-----+---------+----------------+
200             | Field | Type | Null | Key | Default | Extra |
201             +-----------+-------------+------+-----+---------+----------------+
202             | artist_id | int(11) | | PRI | NULL | auto_increment |
203             | artist | varchar(20) | | MUL | | |
204             +-----------+-------------+------+-----+---------+----------------+
205             2 rows in set (0.00 sec)
206            
207             mysql> describe albums;
208             +-----------+-------------+------+-----+---------+----------------+
209             | Field | Type | Null | Key | Default | Extra |
210             +-----------+-------------+------+-----+---------+----------------+
211             | album_id | int(11) | | PRI | NULL | auto_increment |
212             | album | varchar(30) | | MUL | | |
213             | genre | varchar(10) | | MUL | | |
214             | artist_id | int(11) | | MUL | 0 | |
215             +-----------+-------------+------+-----+---------+----------------+
216             4 rows in set (0.00 sec)
217            
218             mysql> describe songs;
219             +----------+-------------+------+-----+----------+----------------+
220             | Field | Type | Null | Key | Default | Extra |
221             +----------+-------------+------+-----+----------+----------------+
222             | ID | int(11) | | PRI | NULL | auto_increment |
223             | title | varchar(40) | | MUL | | |
224             | duration | time | | | 00:00:00 | |
225             | size | int(11) | | | 0 | |
226             | album_id | int(11) | | MUL | 0 | |
227             +----------+-------------+------+-----+----------+----------------+
228             5 rows in set (0.00 sec)
229            
230             It should be clear now WHAT we have to do. Less clear is HOW. The above instructions seem to imply that we manually copy the field structure from the source table to the lookup and normalized tables.
231              
232             Actually, that SQL code (except the DESCRIBEs) was produced by this very module and printed to the STDOUT, so that all I had to do was some cut-and-paste.
233             And then we are back to the question of the algorithm. If this is all SQL, where is Perl involved?
234             The answer is that Perl will reduce the amount of information we need to give to the database engine.
235             The information about the field structure and indexes is already in the database. Our Perl module (with a [not so] little help from the DBI) can extract the structure from the database and create the appropriate SQL statements.
236             On the practical side, this means that, before producing SQL code, this module will gather information about the source table. It will issue a "SHOW FIELDS FROM tablename" and a "SHOW INDEX FROM tablename" statements, and parse their results to prepare the operational code.
237              
238             That's it. It seems a rather small contribution to your average workload, but if you ever have to deal with a project involving several large tables, with many fields, to be transformed into many normalized tables, I am sure you will appreciate the GPL (Golden Perl Laziness) behind this module.
239              
240             BTW, this is the code used to produce the above SQL statements:
241              
242             #!/usr/bin/perl -w
243             use strict;
244              
245             use DBSchema::Normalizer;
246              
247             my $norm = DBSchema::Normalizer->new ({
248             DSN => "DBI:mysql:music;host=localhost;"
249             . "mysql_read_default_file=$ENV{HOME}/.my.cnf",
250             src_table => "MP3",
251             index_field => "album_id",
252             lookup_fields => "artist,album,genre",
253             lookup_table => "tmp_albums",
254             dest_table => "songs",
255             copy_indexes => 1,
256             simulate => 1
257             });
258              
259             $norm->do();
260              
261             $norm->create_lookup_table ({
262             src_table => "tmp_albums",
263             index_field => "artist_id",
264             lookup_fields => "artist",
265             lookup_table => "artists"
266             });
267              
268             $norm->create_normalized_table ({
269             src_table => "tmp_albums",
270             lookup_table => "artists",
271             index_field => "artist_id",
272             lookup_fields => "artist",
273             dest_table => "albums"
274             });
275              
276             Twenty-five lines of code. Not bad for such a complicated task. But even that could have been replaced by these two one-liners:
277              
278             perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
279             album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
280            
281             perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music \
282             tmp_albums artist_id artist artists albums 1 1 1))->do()'
283              
284             (See below item "snew" for more details.)
285            
286             One thing that this module won't do for you, though, is to decide which columns should stay with the source table and which ones should go to the lookup table. This is something for which you need to apply some database theory, and I don't expect you to know it unless you have studied it (unless you happen to be J.F. Codd) either at school or independently.
287             I am planning (in a very idle and distant way) another module that will analyze a database table and decide if it is a good design or not. The examples from commercial software I have seen so far did not impress me a lot. I am still convinced that humans are better than machines at this task. But, hey! Ten years ago I was convinced that humans were much better than machines at chess, and instead, not long ago, I had to see an IBM box doing very nasty things to Gary Kasparov. So maybe I'll change my mind. In the meantime, I am enjoying my present intellectual superiority and I keep analyzing databases with the same pleasure that I once felt when solving a chess problem.
288              
289             =head2 Simulation mode
290              
291             This module can do the data transfer for you, or you may want to run it in "simulation mode", by adding simulate => "1" to the constructor parameters. When in simulation mode, the DBSchema::Normalizer will just print the necessary SQL statements to STDOUT, without passing them to the database engine. You can thus check the queries and eventually change them and use them within some other application.
292              
293             =head2 EXPORT
294              
295             new, snew, create_lookup_table, create_normalized_table
296              
297             =head2 DEPENDENCIES
298              
299             DBI, DBD::mysql
300              
301             =head2 Architecture
302              
303             The Normalizer doesn't enforce private data protection. You are only supposed to call the methods which are documented here as public. In the spirit of Perl OO philosophy, nobody will prevent you from calling private methods (the ones beginning with "_") or fiddling with internal hash fields. However, be aware that such behaviour is highly reprehensible, could lead to unpredictable side effects, of which B are entirely, utterly an irrimediably responsible (not to mention that your reputation will be perpetually tarnished, your nephews will call you "Cheating Joe" and you won't be ever - EVER - invited as dinner speaker to any Perl OO conference and even if you manage to sneak in you will find a hair in your soup.)
304              
305             =head2 PORTABILITY
306              
307             The algorithm used here is general. It was initially developed in C for an embedded RDBMS and there is no reason to assume that it won't work in any other database engine. However, the information about field types and indexes is, in this module, MySQL dependent. At the moment, I haven't found a clean method to get such information in a database-independent way.
308             To adapt this module for a different database, corresponding SQL statements for the MYSQL specific SHOW INDEX and SHOW FIELDS should be provided. Also the syntax for INNER JOIN might not be portable across databases.
309              
310             =head2 CAVEAT
311              
312             As always, when dealing with databases, some caution is needed.
313             The create_lookup_table() method will B, if exists. Be careful about the name you supply for this purpose. If you want to use an existing lookup table (whose data integrity and relational correspondence you can swear upon), then skip the create_lookup_table() and ask only for create_normalized_table(). Also for this one, a B statement is issued before the creation.
314             Exercise double care on the names you pass to the module.
315              
316             Be also aware that the tables created by this module are of default type. You may either choose to convert them after the data transfer or run the Normalizer in "simulation mode" and then manually modify the SQL statements.
317              
318             The Normalizer will usually warn you (and exit with flags and bells) if one or more designated lookup fields in the source table are not indexed. This fact could result in VERY SLOW PERFORMANCE, even for a reasonably low number of records involved.
319             You can choose to ignore this warning, by setting the appropriate parameter, but it is not advisable.
320              
321             If the source table does not have a primary key (don't laugh, I have seen some of them) then a fatal error is issued, without any possible remedy. The reason is simple. If there is no primary key, the engine doesn't have a way of identifying which rows should go in a JOIN and then your result may have duplicates (and in addition you will be waiting a lot to get it.)
322              
323             =head2 TO DO
324              
325             1. Parametrizing the statements for fields and indexes information should improve the chances for
326             portability.
327              
328             # e.g.: MySQL index information comes in this flavor
329             $mysql_index_info = {
330             function_name => "SHOW INDEX FROM $table",
331             info_names => "Table,Non_unique,Key_name,Index_seq,Col_name" #etc
332             };
333             # but it is hard to generalize, especially if another database
334             # engine defines
335             # as positive (Unique) what here is negative (Non_unique)
336              
337             Maybe a more effective way would be to have a polymorphic version of DBSchema::Normalizer, with the base class calling abstract subs, which the descendant classes are supposed to implement.
338             Sounds interesting, even though I feel that I might have some clashes with the DBI.
339              
340             2. Adding support for intermediate steps in converting should also speed up some ugly cases with nested normalization problems, when the lookup table needs to be further normalized.
341              
342             3. Adding support for conversion from Zero normal form to First is not straightforward. Some cases are easy to identify and to deal with (e.g. columns Student1, Student2, Student3, StudentN can be converted to column Student_id pointing at a Students table), but others are more subtle and difficult to generalize (e.g. having two column for Male and Female, with yes/no content).
343              
344             =head2 DISCLAIMER
345              
346             This software can alter data in your database. If used improperly, it can also damage existing data.
347             (And so can any most powerful software on your machine, such as Perl itself. Sorry to scare you, but I have to warn users about potential misuse.)
348             There is B of any sort on this software. This software is provided "AS IS".
349             Please refer to the GPL, GNU General Public License, Version 2, paragraphs 11 and 12, for more details.
350              
351             =head1 SEE ALSO
352              
353             DBI, DBD::mysql
354              
355             =cut
356              
357             package DBSchema::Normalizer;
358 1     1   904 use strict;
  1         2  
  1         34  
359 1     1   5 use warnings;
  1         2  
  1         35  
360              
361 1     1   2276 use DBI;
  1         19825  
  1         58  
362 1     1   1385 use DBD::mysql; # This version is MySQL dependent. It could be changed later
  0            
  0            
363             use Carp;
364             require Exporter;
365              
366             our @ISA = qw(Exporter);
367             our @EXPORT = qw(new snew do );
368             our @EXPORT_OK = qw(create_lookup_table create_normalized_table);
369              
370             our $VERSION = '0.08'; # 09-Nov-2002
371              
372             my @_accepted_parameters = qw(dbh DSN username password src_table index_field
373             lookup_fields lookup_table dest_table copy_indexes verbose
374             simulate ignore_warning);
375              
376             my %_accepted_params = map {$_, 1} @_accepted_parameters;
377              
378             =head1 Class methods
379              
380             =over
381              
382             =item new
383              
384             DBSchema::Normalizer->new ({
385             DSN => $DSN,
386             username => $username,
387             password => $password
388             });
389              
390             new() - object constructor. Requires a hash reference with at least the following keys
391              
392             DSN
393             username
394             password
395              
396             Alternatively, you may pass one already initialized database handler
397              
398             dbh => $dbh
399              
400             Optional fields (in the sense that if you omit them here, you must declare them when calling I or I)
401              
402             src_table The table in 1st normal form
403             index_field the index field that we need to create
404             will become foreign key in the source table
405             and primary key in the lookup table
406             lookup_fields the fields depending on the index,
407             in a comma-separated list
408             lookup_table the lookup table
409             dest_table the Normalized (2nd form) table
410              
411             Really optional fields. You may not mention them at all. Their default is 0.
412              
413             copy_indexes three values:
414             "no" or "0" : no indexes are copied
415             "yes" or "1" : indexes from the source table will
416             be immediately replicated to the
417             destination table
418             "later" or "2" : indexes will be created after the
419             data transfer,
420             as an ALTER TABLE statement.
421             It may speed up the insertion
422             for large tables.
423             verbose if "1", messages indicating what is going on
424             will be sent to STDOUT.
425             Using "2", even more verbose information is
426             given (all queries printed before execution);
427             Level "3" will also show details about src_table
428             fields and indexes;
429             ignore_warning if "1", warning on missing indexes on lookup fields
430             are ignored, and the requested operation carried
431             out even at a price of long waiting. Default "0"
432             simulate if "1", no operation is carried out
433             but the queries are printed to STDOUT (as in
434             verbose => 2)
435              
436             B: src_table, dest_table and lookup_table B be called I or I, which are used internally by the Normalizer. If such names are given, a fatal error is issued.
437              
438             If the keys for src_table, index_field, lookup table and fields are missing, they can (actually they MUST) be later provided by calls to create_lookup_table() and create_normalized_table().
439              
440             =cut
441              
442             sub new {
443             my $class = shift;
444             my $params = shift;
445             my $_dbh = undef;
446             if (exists $params->{dbh} && defined $params->{dbh}) {
447             $_dbh = $params->{dbh}
448             }
449             else {
450             return undef unless defined $params->{DSN};
451             $_dbh= DBI->connect($params->{DSN}, $params->{username},
452             $params->{password}, { RaiseError => 1});
453             }
454             my $self = bless {
455             verbose => 0,
456             copy_indexes => 0,
457             simulate => 0,
458             ignore_warning => 0,
459             _dbh => $_dbh # Being an object, $_dbh is already
460             # a reference. Doesn't need the "\"
461             # before it.
462             }, $class;
463             foreach my $key (keys (%$params)) {
464             croak "invalid parameter $key \n" unless exists $_accepted_params{$key};
465             $self->{$key} = $params->{$key};
466             }
467             if ($self->{simulate} eq "1") {
468             $self->{verbose} = "2";
469             }
470             elsif ($self->{simulate} ne "0") {
471             croak "invalid value for \n";
472             }
473             return ($self);
474             }
475              
476             =item snew
477              
478             snew is a shortcut for new(). It is called with parameters passed by position instead of using a hash reference.
479             It is a "quick-and-dirty" ("dirty" being the operational word) method intended for the impatient who does not want to write a script.
480             B.
481             Parameters are passed in this order:
482              
483             host
484             database
485             source table
486             index field
487             lookup fields
488             lookup table
489             destination table
490             copy indexes
491             verbose
492             simulate
493            
494             Here is an example of one-liner normalization call:
495            
496             perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
497             album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
498              
499             Note: ALL 11 parameters must be passed, or an "use of uninitialized value" error is issued.
500              
501             This one-liner is equivalent to the following script:
502            
503             #!/usr/bin/perl
504             no warnings; # Yeah. No warnings. I said it is equivalent,
505             # not recommended.
506             no strict; # Yup. No strict either.
507             use DBSchema::Normalizer;
508             $norm = DBSchema::Normalizer->new (
509             {
510             DSN => "DBI:mysql:music;host=localhost;"
511             . "mysql_read_default_file=$ENV{HOME}/.my.cnf",
512             src_table => "MP3",
513             index_field => "album_id",
514             lookup_fields => "artist,album,genre",
515             lookup_table => "tmp_albums",
516             dest_table => "songs",
517             copy_indexes => 1,
518             verbose => 1,
519             simulate => 1,
520             });
521             $norm->do();
522              
523             It is definitely not as safe as the normal call. However, TMTOWTDI, and it's your call. I am using it, but I don't recommend it. Read my lips: I DON'T RECOMMEND IT.
524              
525             =cut
526              
527             sub snew { # shortcut new (parameters called by position)
528             my ($class, $host, $db, $src_table, $index_field,
529             $lookup_fields, $lookup_table, $dest_table,
530             $copy_indexes, $verbose, $simulate) = @_;
531             my $DSN= "DBI:mysql:$db;host=$host;"
532             . "mysql_read_default_file=$ENV{HOME}/.my.cnf";
533             return new ($class, {
534             DSN => $DSN,
535             src_table => $src_table,
536             index_field => $index_field,
537             lookup_fields => $lookup_fields,
538             lookup_table => $lookup_table,
539             dest_table => $dest_table,
540             copy_indexes => $copy_indexes,
541             verbose => $verbose,
542             simulate => $simulate,
543             });
544             }
545              
546             =for internal use
547             (The Destroyer will clean-up DBI objects.)
548              
549             =cut
550              
551             # use Data::Dumper;
552             sub DESTROY {
553             my $self = shift;
554             # print STDERR Data::Dumper->Dump([$self],["InDestroyer"]);
555             $self->{_dbh}->disconnect();
556             undef $self->{_dbh};
557             }
558              
559             =item do
560              
561             do();
562              
563             do() performs the Normalization according to the parameters already received. Internally calls I and I
564              
565             Will fail if not enough parameters have been supplied to new()
566              
567             =cut
568              
569             sub do {
570             my $self = shift;
571             return 0 unless $self->_init();
572             $self->create_lookup_table();
573             $self->create_normalized_table();
574             return 1;
575             }
576              
577             =for internal use
578             (Checks that given keys in internal blessed hash are defined)
579              
580             =cut
581              
582             sub _init_field {
583             my $self = shift;
584             my @fields = @_;
585             my $def = 1;
586             foreach (@fields) {
587             if (!defined $self->{$_}) {
588             $self->_verbose("0", "missing $_\n");
589             return 0;
590             }
591             }
592             return 1;
593             }
594              
595             =for internal use
596             (_verbose() will print a message, depending on the currently set verbose level)
597              
598             =cut
599              
600             sub _verbose {
601             my $self = shift;
602             my $level = shift;
603             my $msg = shift;
604             if ($self->{verbose} >= $level) {
605             $msg =~ s/\s+/ /g;
606             print STDOUT "$msg\n";
607             }
608             }
609              
610             =for internal use
611             (_get_indexes() will find the indexes from src_table and set the internal values _primary_key_fields and _normal_fields_indexes with DML instructions to re-create the indexes within a SQL statement - It will identify multiple and unique keys)
612              
613             =cut
614              
615             sub _get_indexes {
616             my $self = shift;
617             # gets indexes description from the DB engine
618             my $DML = "SHOW INDEX FROM $self->{src_table}";
619             $self->_verbose("2","#$DML;");
620             my $sth = $self->{_dbh}->prepare($DML);
621             $sth->execute();
622             my %indexes = (); # list of indexes with associated columns
623             my @unique = (); # list of unique indexes
624             my $new_index_added =0;
625             my @lu_fields = split /,/, $self->{lookup_fields};
626             my %lu_indexes = map { $_ , 0 } @lu_fields;
627             while (my $hash_ref = $sth->fetchrow_hashref()) {
628             $self->_verbose("3",
629             "# $hash_ref->{Key_name}\t$hash_ref->{Column_name} ");
630             # check that lookup fields have an associated index
631             if (exists $lu_indexes{$hash_ref->{Column_name}}) {
632             $lu_indexes{$hash_ref->{Column_name}} = 1;
633             }
634             # first, we collect primary key columns
635             if ($hash_ref->{Key_name} eq "PRIMARY") {
636             # check if primary key column is among the lookup fields
637             # and if so, replace any lookup field reference with
638             # the new index (foreign key)
639             if (grep {$hash_ref->{Column_name} eq $_} @lu_fields) {
640             if (!$new_index_added) {
641             $self->{_primary_key_fields} .= ","
642             if $self->{_primary_key_fields} ;
643             $self->{_primary_key_fields} .=$self->{index_field};
644             $new_index_added =1;
645             }
646             }
647             else {
648             $self->{_primary_key_fields} .= ","
649             if $self->{_primary_key_fields} ;
650             $self->{_primary_key_fields} .=$hash_ref->{Column_name};
651             }
652             }
653             else {
654             # collects normal columns indexes, skipping lookup fields
655             next if (grep {$hash_ref->{Column_name} eq $_} @lu_fields);
656             $indexes{$hash_ref->{Key_name}} .= ","
657             if $indexes{$hash_ref->{Key_name}};
658             $indexes{$hash_ref->{Key_name}} .=$hash_ref->{Column_name};
659             push @unique, $hash_ref->{Key_name}
660             if $hash_ref->{Non_unique} eq "0";
661             }
662             }
663             $self->{_primary_key_fields} =
664             ", PRIMARY KEY (" . $self->{_primary_key_fields} . ")"
665             if $self->{_primary_key_fields};
666             foreach my $key (keys %indexes) {
667             # create the indexes description for SQL
668             $self->{_normal_fields_indexes} .= ", ";
669             $self->{_normal_fields_indexes} .= " UNIQUE "
670             if grep { $key eq $_ } @unique;
671             $self->{_normal_fields_indexes} .=
672             " KEY " . $key ." ($indexes{$key})";
673             }
674             # check for primary key and keys associated with lookup fields.
675             croak "missing primary key in $self->{src_table}\n"
676             . " A primary key is needed for this operation\n"
677             unless ($self->{_primary_key_fields});
678             if (grep {$_ == 0} values %lu_indexes) {
679             print STDERR "*" x 70, "\n";
680             print STDERR
681             "WARNING. the following columns, identified as lookup fields,\n"
682             . "ARE NOT INDEXED. This fact will have a huge inpact on "
683             . "performance.\n"
684             . "Therefore it is advisable to set these indexes before "
685             . "continuing\n";
686             print STDERR "missing indexes: ",
687             map {"<$_> "} grep {$lu_indexes{$_} == 0} keys %lu_indexes;
688             print STDERR "\n";
689             print STDERR "*" x 70, "\n";
690             if ($self->{ignore_warning}) {
691             print STDERR
692             "you chose to ignore this warning and the operation is "
693             . "carried out anyway, as you wish\n";
694             }
695             else {
696             croak
697             "missing indexes in lookup fields - operation aborted\n";
698             }
699             }
700             }
701              
702             =for internal use
703             (_get_field_descriptions() will extract data definition from src_table and prepare apt statements to re-create the needed fields in dest_table and lookup_table)
704              
705             =cut
706              
707             sub _get_field_descriptions {
708             my $self = shift;
709             # gets table description from DB engine
710             my $DML = "SHOW FIELDS FROM $self->{src_table}"; # DESCRIBE $self->{src_table} would have the same effect
711             $self->_verbose("2","#$DML;");
712             my $sth = $self->{_dbh}->prepare($DML);
713             $sth->execute();
714             my @lu_fields = split /,/, $self->{lookup_fields};
715             # divide description between normal fields (which will go to the
716             # destination table) and lookup fields (for the lookup table)
717             while (my $hash_ref = $sth->fetchrow_hashref()) {
718             $self->_verbose("3", "#$hash_ref->{Field}\t$hash_ref->{Type}");
719             if (grep {$hash_ref->{Field} eq $_} @lu_fields) {
720             $self->{_lookup_fields_description} .= ","
721             if $self->{_lookup_fields_description} ;
722             $self->{_lookup_fields_description} .= $hash_ref->{Field}
723             . " " . $hash_ref->{Type};
724             $self->{_lookup_fields_description} .= " not null "
725             unless $hash_ref->{Null};
726             $self->{_lookup_fields_description} .=
727             " default " . $self->{_dbh}->quote($hash_ref->{Default})
728             if $hash_ref->{Default};
729             }
730             else {
731             $self->{_normal_fields_description} .= ","
732             if $self->{_normal_fields_description} ;
733             $self->{_normal_fields_description} .=
734             $hash_ref->{Field} . " " . $hash_ref->{Type};
735             $self->{_normal_fields_description} .= " not null "
736             unless $hash_ref->{Null};
737             $self->{_normal_fields_description} .= " default "
738             . $self->{_dbh}->quote($hash_ref->{Default})
739             if $hash_ref->{Default};
740             if (lc $hash_ref->{Extra} eq "auto_increment"
741             and $self->{copy_indexes})
742             {
743             $self->{_normal_fields_description} .=
744             " auto_increment ";
745             }
746             $self->{_non_lookup_fields} .= ","
747             if $self->{_non_lookup_fields} ;
748             $self->{_non_lookup_fields} .= $hash_ref->{Field};
749             }
750             }
751             }
752              
753             =for internal use
754             (_init() will clean the description fields ane fill them with appropriate calls to _get_field_descriptions() and _get_indexes())
755             Uncommenting the lines mentioning Data::Dumper will produce useful debug information.
756              
757             =cut
758              
759             #use Data::Dumper;
760             sub _init {
761             my $self = shift;
762             return 0 unless $self->_init_field(qw(src_table lookup_table
763             dest_table index_field lookup_fields));
764             $self->{lookup_fields} =~ tr/ //d;
765             my @lookup_fields = split /,/, $self->{lookup_fields};
766             croak "invalid index field"
767             if grep {$self->{index_field} eq $_} @lookup_fields;
768             # and are the aliases for source and lookup tables used
769             # in the final query.
770             # Therefore they can't be accepted as normal table names
771             croak " and are reserved words for this module. "
772             . "Please choose a different name\n"
773             if grep { /^(:?src|lkp)$/ } ($self->{src_table},
774             $self->{dest_table}, $self->{lookup_table});
775             #print STDERR Data::Dumper->Dump([$self], [ ref $self ]),"\n"; <>;
776             $self->{$_} ="" foreach (qw(_normal_fields_indexes
777             _lookup_fields_description _non_lookup_fields
778             _normal_fields_description _primary_key_fields));
779             $self->_get_field_descriptions();
780             $self->_get_indexes
781             if lc $self->{copy_indexes} =~ /^(:?1|2|yes|later)$/;
782             #print STDERR Data::Dumper->Dump([$self],["AfterInit"]),"\n"; <>;
783             return 1;
784             }
785              
786             =for internal use
787             (gets additional parameters into internal hash)
788              
789             =cut
790              
791             sub _get_params {
792             my $self = shift;
793             # if parameters are provided, they are merged with the internal hash
794             # and _init() is called
795             if (scalar @_) {
796             my $params = shift;
797             foreach my $key (keys %$params) {
798             croak "invalid parameter $key \n"
799             unless exists $_accepted_params{$key};
800             $self->{$key} = $params->{$key};
801             }
802             if ($self->{simulate} eq "1") {
803             $self->{verbose} = "2";
804             }
805             elsif ($self->{simulate} ne "0") {
806             croak "invalid value for \n";
807             }
808             $self->_init();
809             }
810             }
811              
812             =item create_normalized_table()
813              
814             create_normalized_table() will create a 2nd normal form table, getting data from a source table and a lookup table.
815             Lookup fields in the source table will be replaced by the corresponding index field in the lookup table.
816              
817             If called without parameters, assumes the parameters passed to the object constructor.
818              
819             Parameters are passed as a hash reference, and are the same given to new() except I, I and I. None are compulsory here. The missing ones are taken from the constructor. However, a check is done to ensure that all parameters are passed from either sub.
820              
821             =cut
822              
823             sub create_normalized_table {
824             my $self = shift;
825             $self->_get_params(@_);
826             my $join_clause = "";
827             my $good_fields = ""; # fields that will be moved to the
828             # destination table
829             # ensure that the fields are called with an appropriate table alias
830             foreach (split /,/, $self->{_non_lookup_fields}) {
831             $good_fields .= ", " if $good_fields;
832             $good_fields .= "src." . $_;
833             }
834             # create the JOIN clause, using the lookup fields as foreign keys
835             foreach (split /,/, $self->{lookup_fields}) {
836             $join_clause .= " and " if $join_clause;
837             $join_clause .= "src.$_ =lkp.$_";
838             }
839             # removes any existing table with the same name as dest_table.
840             my $DML = "DROP TABLE IF EXISTS $self->{dest_table}";
841             $self->_verbose("2", "$DML;");
842             $self->{_dbh}->do ("DROP TABLE IF EXISTS $self->{dest_table}");
843             # creates the destination table.
844             $DML =qq[CREATE TABLE $self->{dest_table}
845             ($self->{_normal_fields_description},
846             $self->{index_field} INT(11) NOT NULL];
847             if (defined $self->{copy_indexes}
848             and (lc $self->{copy_indexes} =~ /^(:?1|yes)$/))
849             {
850             $DML .= $self->{_primary_key_fields} ;
851             $DML .= $self->{_normal_fields_indexes};
852             }
853             $DML .= qq[, KEY $self->{index_field} ($self->{index_field}))];
854             if (defined $self->{copy_indexes}
855             and (lc $self->{copy_indexes} =~ /^(:?2|later)$/))
856             {
857             $DML =~ s/ auto_increment / /;
858             print "# auto_increment for $self->{dest_table} needs to "
859             . "be set manually\n";
860             }
861             $self->_verbose("2", "$DML;");
862             $self->{_dbh}->do($DML) unless ($self->{simulate});
863             # inserts values into the destination table, from the source table
864             # JOINed with the lookup table
865             $DML = qq[INSERT INTO $self->{dest_table}
866             SELECT $good_fields, $self->{index_field}
867             FROM $self->{src_table} src
868             INNER JOIN $self->{lookup_table} lkp ON ($join_clause)];
869             $self->_verbose("2", "$DML;");
870             $self->{_dbh}->do($DML) unless ($self->{simulate});
871             # if copy indexes was , then an ALTER TABLE statement
872             # is issued.
873             if (defined $self->{copy_indexes}
874             and (lc $self->{copy_indexes} =~ /^(:?2|later)$/))
875             {
876             $DML = qq[ALTER TABLE $self->{dest_table} ];
877             if ($self->{_primary_key_fields}) {
878             $self->{_primary_key_fields} =~ s/^\s?,/ADD/;
879             $DML .= $self->{_primary_key_fields};
880             }
881             if ($self->{_normal_fields_indexes}) {
882             $self->{_normal_fields_indexes} =~ s/^\s?,//
883             unless $self->{_primary_key_fields};
884             $self->{_normal_fields_indexes} =~ s/,/, ADD /g;
885             $DML .= $self->{_normal_fields_indexes};
886             }
887             $self->_verbose("2", "$DML;");
888             $self->{_dbh}->do($DML) unless ($self->{simulate});
889             }
890             $self->_verbose("1", "# $self->{dest_table} created and filled");
891             }
892              
893             =item create_lookup_table
894              
895             create_lookup_table() will create a lookup table, extracting repeating fields from a 1st normal form table. A numeric primary key is created.
896              
897             When called without parameters, assumes the values passed to the object constructor (I).
898              
899             Parameters are passed as a hash reference, and should include the following
900              
901             src_table table where to take the values from
902             lookup_fields which fields to take
903             lookup_table table to create
904             index_field primary key (will be foreign key in src_table)
905             to be created
906              
907             =cut
908              
909             sub create_lookup_table {
910             my $self = shift;
911             $self->_get_params(@_);
912             my $table_keys ="";
913             foreach (split /,/, $self->{lookup_fields}) {
914             $table_keys .= ", KEY $_ ($_)";
915             }
916             # removes any existing table with the same name as Lookup_table
917             my $DML = qq[DROP TABLE IF EXISTS $self->{lookup_table}];
918             $self->_verbose("2", "$DML;");
919             $self->{_dbh}->do($DML) unless ($self->{simulate});
920             # create the new table
921             $DML = qq[CREATE TABLE $self->{lookup_table}
922             ($self->{index_field} INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
923             $self->{_lookup_fields_description} $table_keys) ];
924             $self->_verbose("2", "$DML;");
925             $self->{_dbh}->do($DML) unless ($self->{simulate});
926             # gets fields from the source table
927             $DML = qq[INSERT INTO $self->{lookup_table}
928             SELECT DISTINCT NULL, $self->{lookup_fields}
929             FROM $self->{src_table}];
930             $self->_verbose("2", "$DML;");
931             $self->{_dbh}->do($DML) unless ($self->{simulate});
932             $self->_verbose("1", "# $self->{lookup_table} created and filled");
933             }
934              
935             =back
936              
937             =head1 AUTHOR
938              
939             Giuseppe Maxia, giuseppe@gmaxia.it
940              
941             =head1 COPYRIGHT
942              
943             The DBSchema::Normalizer module is Copyright (c) 2001 Giuseppe Maxia,
944             Sardinia, Italy. All rights reserved.
945            
946             You may distribute this software under the terms of either the GNU
947             General Public License version 2 or the Artistic License, as
948             specified in the Perl README file.
949            
950             The embedded and encosed documentation is released under
951             the GNU FDL Free Documentation License 1.1
952              
953             =cut
954              
955             1;