At vælge filformat til digital bevaring
32 millioner digitaliserede avissider er en meget stor mængde digital data at skulle gemme, så kommende generationer kan få adgang til dem.
Vi var fra starten klar over, at hvis vi skal bevare de digitaliserede aviser på den bedst mulige måde, var det vigtigt at vælge det rigtige format til filerne. At foretage dette valg viste sig at være en tidsforbrugende og omfattende proces.
I starten af projektet håbede vi, at vi kunne finde nogle rettesnore – måske endog en markør – i forhold til hvilket format vi skulle vælge ved at undersøge de valg og de erfaringer, som andre projekter og biblioteker havde gjort sig. Det viste sig dog hurtigt, at det ikke var helt så enkelt. Den ultimative registrant over bevaringsformater eller vejledninger, der kunne hjælpe os med at vælge, var ingen steder at finde.
Vi opdagede hurtigt, at vi skulle overveje en del parametre for at kunne vælge et filformat. Akkurat ligesom andre biblioteker og arkiver før os var vi nødt til at tænke på størrelse, økonomi, bæredygtighed, rettigheder etc., førend vi kunne træffe et valg.
Andres erfaringer gjorde dog, at vi kunne snævre valget ind til at stå imellem to formater: JPEG2000 or TIFF. Rammerne for projektet, primært økonomi og omfang, har været afgørende for vores valg.
Baseret på en kalkulation af det totale omfang af samlingen på 32 millioner digitaliserede avissider, blev der genereret et estimat for begge filformater. Resultatet var, at 32 millioner sider i tabsfri TIFF ville resultere i ca. 1.300 terabytes, og 32 millioner sider i tabsfri JPEG2000 ville resultere i ca. 800 terabytes.
Disse estimater illustrerer størrelsen af projektet. De viser også, at økonomien i valget af et filformat til digital langtidsbevaring og tilgængeliggørelse af filerne er et meget vigtigt aspekt.
Analyse af formater
Udover den estimerede størrelse og de efterfølgende økonomiske overvejelser, blev de to filformater sammenlignet ud fra forskellige aspekter for at sikre, at det valgte format kunne leve op til alle vores krav i forhold til digital bevaring. Statsbibliotekets krav i forhold til digital bevaring er beskrevet i dokumentet Statsbibliotekets Strategi for digital bevaring, hvor vi har specificeret en række format karakteristika, som har indflydelse på vores valg af bevaringsformat.
Vi har undersøgt de følgende punkter i nærmere detaljer:
- Formatets åbenhed – er det begrænset på nogen måde?
- Udbredelsen af formatet – hvor udbredt er formatet, hvem bruger det og med hvilket formål? Kan vi forvente, at formatet også findes og bruges på længere sigt?
- International accept af formatet til brug for digital langtidsbevaring – hvor nemt og sikkert vurderes formatet til at være i forhold til for eksempel bevaring, samlingsforvaltning og migrering?
- Formatets fejltolerance – hvor bits og bytes can forsvinde førend det ikke længere er muligt at læse data? Hvor sårbart er formatet?
- Formatets selvindeholdthed – er formatet afhængigt af andre formater for at kunne fungere?
- Økonomi – hvor dyrt er formatet at bruge til bevaring, samlingsforvaltning og migrering?
- Formatets formidlingspotentiale – er formatet egnet til formidling? Har vi brug for et andet format til formidling, eller kan bevaringsformatet også bruges til formidling, og derved kan vi undgå at skulle bevare en masse ekstra – og dyre – terabytes?
Det viste sig, at begge formater var egnet til vores brug men på grundlag af analysen, inklusiv det økonomiske aspekt, har vi valgt JPEG2000 som vores bevaringsformat. Mere præcist har vi valgt del 1 af JPEG2000 gruppen, nemlig JP2. Ud fra format analysen har vi defineret en filformat specifikation af JP2, som understøtter vores bevaringsbehov.
Vores analyse af formatet JPEG2000 er baseret på research lavet på de følgende kilder:
- http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37674
- http://books.google.dk/books?id=4eOnNwXah7EC&printsec=frontcover&hl=da&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false
- http://digidaily.kb.se/wp-content/uploads/2011/11/JPEG2000-utredningsrapport.pdf
- http://digidaily.kb.se/wp-content/uploads/2011/11/kravspecifikation.pdf
- http://dltj.org/article/lossless-jpeg2000/
- http://echoone.com/filejuicer/formats/jp2
- http://en.wikipedia.org/wiki/Comparison_of_web_browsers
- http://en.wikipedia.org/wiki/JPEG_2000
- http://fclaweb.fcla.edu/uploads/Lydia%20Motyka/FDA_documentation/Action_Plans/jp2_bg.pdf
- http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5712946&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5712946
- http://library.wellcome.ac.uk/assets/wtx056572.pdf
- http://old.hki.uni-koeln.de/people/herrmann/forschung/heydegger_archiving2008_40.pdf
- http://tech.groups.yahoo.com/group/kakadu_jpeg2000/message/479
- http://wiki.opf-labs.org/display/JP2/Example+JP2+profiles
- http://wiki.opf-labs.org/display/SP/commentsRossSpencer
- http://www.collectionscanada.gc.ca/digital-initiatives/012018-2210-e.html
- http://www.digitalpreservation.gov/formats/fdd/browse_list.shtml
- http://www.dlib.org/dlib/july08/buonora/07buonora.html
- http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html
- http://www.dpconline.org/advice/faqs/588-faq-tiff-or-jpeg2000
- http://www.dpconline.org/docs/reports/dpctw08-01.pdf
- http://www.gdal.org/frmt_jp2kak.html
- http://www.impact-project.eu/faqs/impact-strategic-faq-answers/
- http://www.kakadusoftware.com/documents/Usage_Examples.txt
- http://www.loc.gov/ndnp/guidelines/docs/NDNP_JP2HistNewsProfile.pdf
- http://www.nb.no/anbud/old/2011/mikrofilmdigitalisering/vedlegg/Vedlegg_1_Kravspesifikasjon_mikrofilmdigitalisering_(NO).docx
- http://www.nb.no/bokhylla/om/digitalisering-av-boeker-i-nasjonalbiblioteket
- Article “A heuristic measure for detecting influence of lossy JP2 compression on OCR in the absence of ground truth” (Schlarb & Neudecker)
Choosing file format for digital preservation
32 million digitised newspaper pages are a lot of digital data to preserve for posterity.
For us to be able to preserve the digitised newspapers in the best possible way we knew it was important to choose the right format for the digital files. Choosing the right format has been a time-consuming and exhaustive process.
In the initial phases of the project we hoped that looking into the choices and experiences of other projects and libraries would offer us some guidelines – maybe even a marker – as to what format to choose. However, it turned out that things were not that simple. The ultimate format registry or guidelines which could help us choose the right format were nowhere to be found.
We soon discovered that we had to take quite a few parameters into consideration when choosing a file format. Just like other libraries and archives before us we were forced to think about size, economy, sustainability, rights etc. before we could make a decision.
The experiences of others did, however, narrow our choice down to two formats: JPEG2000 or TIFF. The boundaries of the project, primarily economy and size, have been very decisive for our choice.
Based on a calculation of the expected total size of the 32 million digital newspaper pages, an estimate for each file format was generated. The result was that 32 million pages in TIFF with lossless compression would add up to approximately 1,300 terabytes and 32 million pages in lossless JPEG2000 would add up to approximately 800 terabytes.
These estimates demonstrate the large scale of the project. It also shows that the economical impact of choosing a file format that can be used for long-term digital preservation and dissemination of the files is a very important aspect.
Analysis of formats
Aside from the estimated size and subsequent economic considerations, the file formats were compared on different aspects to ensure that the format would meet all our requirements for carrying out digital preservation. The library’s requirements for digital preservation are described in library’s Digital Preservation Strategy, where we have specified a range of format properties that influence our choice of format.
We have examined in detail:
- The openness of the format – is it in any way restricted?
- The spread of the format – how widely used is the format, who is using it and for what purposes? Can we expect the format to be in anyway sustained in the future?
- The acceptance of the format as long term preservation format in the international digital preservation community – how easy and secure is the format judged to be considering for example preservation, curation and migration?
- The error tolerance of the format – how many bits and bytes can go missing before the data becomes unreadable? How vulnerable is the format?
- The self containedness of the format – does the format depend on other formats to be rendered?
- The economy – how expensive is the format to preserve, curate and migrate?
- The dissemination potential in the preservation format – is the format suited for dissemination purposes? Will we need a separate dissemination format or will the preservation format be useable for dissemination too (and thereby probably save us from preserving a lot of extra – and expensive – terabytes)?
Both formats turned out to be applicable for our use but based on the analysis, including the economic aspect, we have chosen JPEG2000 as our preservation format. More specific, we have chosen part 1 of the JPEG2000 family, namely JP2. Based on the format analysis we have defined a profile specification of JP2 that supports our preservation needs.
Our analysis of the JPEG2000 format is based on research into the following resources:
- http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37674
- http://books.google.dk/books?id=4eOnNwXah7EC&printsec=frontcover&hl=da&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false
- http://digidaily.kb.se/wp-content/uploads/2011/11/JPEG2000-utredningsrapport.pdf
- http://digidaily.kb.se/wp-content/uploads/2011/11/kravspecifikation.pdf
- http://dltj.org/article/lossless-jpeg2000/
- http://echoone.com/filejuicer/formats/jp2
- http://en.wikipedia.org/wiki/Comparison_of_web_browsers
- http://en.wikipedia.org/wiki/JPEG_2000
- http://fclaweb.fcla.edu/uploads/Lydia%20Motyka/FDA_documentation/Action_Plans/jp2_bg.pdf
- http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5712946&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5712946
- http://library.wellcome.ac.uk/assets/wtx056572.pdf
- http://old.hki.uni-koeln.de/people/herrmann/forschung/heydegger_archiving2008_40.pdf
- http://tech.groups.yahoo.com/group/kakadu_jpeg2000/message/479
- http://wiki.opf-labs.org/display/JP2/Example+JP2+profiles
- http://wiki.opf-labs.org/display/SP/commentsRossSpencer
- http://www.collectionscanada.gc.ca/digital-initiatives/012018-2210-e.html
- http://www.digitalpreservation.gov/formats/fdd/browse_list.shtml
- http://www.dlib.org/dlib/july08/buonora/07buonora.html
- http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html
- http://www.dpconline.org/advice/faqs/588-faq-tiff-or-jpeg2000
- http://www.dpconline.org/docs/reports/dpctw08-01.pdf
- http://www.gdal.org/frmt_jp2kak.html
- http://www.impact-project.eu/faqs/impact-strategic-faq-answers/
- http://www.kakadusoftware.com/documents/Usage_Examples.txt
- http://www.loc.gov/ndnp/guidelines/docs/NDNP_JP2HistNewsProfile.pdf
- http://www.nb.no/anbud/old/2011/mikrofilmdigitalisering/vedlegg/Vedlegg_1_Kravspesifikasjon_mikrofilmdigitalisering_(NO).docx
- http://www.nb.no/bokhylla/om/digitalisering-av-boeker-i-nasjonalbiblioteket
- Article “A heuristic measure for detecting influence of lossy JP2 compression on OCR in the absence of ground truth” (Schlarb & Neudecker)
Skriv et svar