Why is there no index for CSV files? Indexes are very simple.
If the first column of your CSV file is in sorted order you can do a binary search to find your data. But what if you need to find data in the second or third column?
If you have a separate index file pointing to the first byte of each line you could seek to that position in the CSV file and get your data. Given a file with only your needed column and the byte offset and length of the line you can search the index to find the pointer to the position in the CSV file.
SAMPLE DATA
“record”,”title”,”data”
1,”test1”,”data1”
2,”test3”,”data3”
3,”test2”,”data2”
SAMPLE INDEX
“test1”,24,17
“test2”,58,17
“test3”,41,17
“title”,0,23
Here is a simple perl program to create just such an index.
#!/usr/bin/perl # csvindex.pl use strict; use warnings; use Text::CSV_XS; my $file = $ARGV[0]; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, sep_char => ',' # not really needed as this is the default }); open(my $data, '<:encoding or die not open my while>getline( $data )) { $tail = tell $data ; printf("\"%s\",%d,%d\n",$fields->[$ARGV[1]], $head, $tail-$head); $head = $tail; }
This command create an index of the second column (number start with 0).
$ perl csvindex.pl 2009.csv 1 | sort > 2009.cidx
A binary search can be done with the linux ‘look’ command.
$ look \"OK\" 2009.cidx "OK",2134,59
Here is a simple program to seek into a file and print ‘n’ number of characters.
#!/usr/bin/perl # csvseek.pl use strict; use warnings; use Text::CSV_XS; my $file = $ARGV[0]; open(my $data, '<:encoding or die not open seek my read chomp print .>Here is how it is used.
$ perl csvseek.pl 2009.csv 2134 59 OKLAHOMA,OK,40,"4,765","1,251","2,160","1,024",6,"9,206",I’d like to see these tools added to CSVKit.
What I’d really like to see is this index/s added to MySQL’s CSV engine. Would anyone like to help me get this done?
Tweet
Log Buffer #279, A Carnival of the Vanities for DBAs | The Pythian Blog wrote:
[…] is there no index for CSV files? Indexes are very simple. Mark Grennan has […]
Link | July 27th, 2012 at 2:01 am
Price wrote:
There is certainly a lot to find out about this issue.
I really like all of the points you have made.
Link | July 27th, 2012 at 2:04 pm