Summary

This component helps enhancing your content with auto generated summaries from PDF files. It is combined with the download component, so the output rendered will be a summary of the PDF file and a download button to download the original file. The summary of a given pdf file and will be saved in the same folder as the source PDF. As much as possible, the component generates also a simple ToC of the summary to enable easy navigation for larges summaries.

We cannot guarantee 100% the quality of the summary generated by this component as this depends on various parameters (such as: the model used, the language, the correct spelling and grammar, the structure of inputs). For this reason we advice to review and do corrections as needed before deploying the result to your production site. After applying manual corrections is necessary to build and deploy the site again. The summary will not be generated one more time, the previously corrected text will be used to render the content.

We use local downloaded models from Huggingface. Your data will be not sent to Huggingface to be used for training the models. This implies that updating the models to the last version is not automatically and, when needed, should be made manually by removing the specific cached model from ~/.cache/huggingface/hub/* on MacOS/Linux or from C:\Users\<YourUsername>\.cache\huggingface\hub on Windows. We recommend regular checkings on Huggingface to identify if there are updates for the models used.

Usage

Generating PDF summaries can be a time/resources consuming task. For this reason we strongly recommend to build sour site incrementally when using this component in many documents. Remeber that, once a summary was generated and exists in the site, will be not generated again at subesequent builds even if manual corrections are applied on it betweeen builds. So, if you plan to use this component multiple times in your documents, do it gradually, document by document. Do not include it from the first step in all documents you want. Do it with a document, build the site and then go to the next document.

Do not attempt to build you site directly on GitHub pages when using this component. By default, the deployment action does not allow this but can be easlily modified to do so. The build time can be very long and will consume your build minutes. Always build locally and deploy it afterwards.

Example

The next examples are based on the following files:

📁 pdf-summary/
├── 📄 pdf-summary.md
├── 📄 pe.pdf
├── 📄 pe__pdf_firstpage.png
├── 📄 pe__pdf_summary.txt
├── 📄 pr.pdf
├── 📄 pr__pdf_firstpage.png
├── 📄 pr__pdf_summary.txt
├── 📄 pt.pdf
├── 📄 pt__pdf_firstpage.png
└── 📄 pt__pdf_summary.txt

The model used for this summarisation is:

pdf_sum_model: "facebook/bart-large-cnn"

In the first example we have a generated summary without any manual correction to allow understanding the limitations of the used model (which is one of the best rated open source and multilanguage summarisation model for medium and large texts). However, the model can be replaced easily in _data/buildConfig.yml configuration file. In case of a multilanguage documentation site, it is even possible to set a different model for each language. Be aware that not all open source models can work in this summarisation context. If a non compatible model is set, the raised error message (at build time) will give a list of compatible models to be chosen from Huggingface.

Note that the PDF summaries are saved in the same folder as the original pdf file and are named <pdf_file_name>__pdf_summary.txt. When building the site, as long as a file respecting the naming convention is found in the folder, the summary for <pdf_file_name>.pdf will be not generated again. Feel free to apply any manual correction to the summary file for removing not relevant paragraphs and/or model hallucinations. Be aware that, after each manual correction, the site must be built again. If is needed to force re-generating the summary, just delete the <pdf_file_name>__pdf_summary.txt and build the site again.

Keep in mind that re-generation of the summary will lose any previously applied manual corrections.

PDF 1

The next example demonstrates a generated summary without any subsequent manual correction. It also shows the behaviour of the component when a fixed height is not specified for the summary text and the summary text is pretty long.

{% include elements/pdf-summary.html 
    file="_experiments/pdf-summary/pr.pdf"
    btnType="danger"
    btnOutline="false"
    btnText="Download"
    sBorder="true"
%}

START PDF SUMMARY

List of figures

A dedicated API workshop was organised in the first week of December 2024. All APIs were tested and there were no errors found. Comments about potential improvements related to the API documentation were well received.

In October 2024 we produced the first version of the IMPOR.OR.D10 - Full plan for achieving Operational Readiness. The final SAT was scheduled in the first half of November 2024 (and was carried out in Birnin Zana on 6th and 7th of Nov 2024) The stakeholders were informed that an annex to the initial assignment was already signed by CAO and the Contactor. The annex extends the contract duration with 18 months.

The topic of eID API (to be used by third parties for integrating eID features in external systems) was raised by some stakeholders. Based on the expert opinion from a parallel AVENGERS-AID project, it was strongly demanded to make the API public as soon as possible. It is still not clear if the needed material and human resources can be allocated and used starting with 2025.

The final Site Acceptance Testing session was carried out in Birnin Zana, on 6th and 7th of November 2024. The conclusion is that the eID system provides all needed API for integrating eID features into third party systems. Any other use case is possible but must be subject to other projects (coming both from public administration area or from private sector) It was clear that, considering the features made available through APIs, a well-defined formal and technical API control and monitoring framework must be put in place.

It was useful to brainstorm the most probable integration use cases (strong authentication for eWakanda or strong authentication and moving the whole public notary document signing flow into eNotary by implementing digital signature API) There is no single body to coordinate all these stakeholders and to properly maintain the eID system. The needed resources (human and material) necessary for providing the support services to the end-users (citizens and RAO) are not identified (sized) and provisioned. The necessary procedures and tools are not in place.

Contractor

eID system will be close to the moment when will be released for operations. Human resources involved must understand that eID will be part of their day-by-day duties. The risk probability is still M (medium) since there is still one quarter available for appointing the eID operator.

In any case, the risk impact on achieving the Operational Readiness is H (high), even blocking for some activities. IMP.5 Work Breakdown Structureo IMP.BDIS = Build Digital Identity Service Implementation Chapter (design, development, and implementation) For more information, or to get involved in a project of your own, go to: http://www.ibm.com/digitalidentity.

WBS Codes for Project Deliverables

The WBS codes used for deliverables are D(1-12), PR, ORR, MOM, DFR and FR. The code is derived from the deliverable name (PR = Progress; ORR = On-Request Report; DFR = Draft Final Report; FR = Final Report) If a complex deliverable must be split in several distinct parts (such as Deliverable 9 – D9), the code contains also a numeric identifier.

Specific Objectives

• Specific Objective 1: To ensure availability of relevant information. The inventory of existing information systems collaborate for the implementation. The system is installed and configured in the appropriate way.

Adequate and sufficient beneficiary resources have been provided. All potential non-conformities (from FATtesting) were corrected.

Acceptance Testing

• All potential non-conformities (from Functionalproduction environment) are considered. The planning of the training sessions is also considered. Adequate and sufficient beneficiary protections are in place.

A workplan has been drawn up for the project. The summary of the workplan is:Table 4: Workplan.

Provision Technical Support L2 and L3

The next reporting period is Q9 (January, February, March 2025) This is the period necessary for drafting the next Progress Report (MCPR-9) The expected new deliverables will be:At the end of the next reportingperiod, the expected work in progress will be.

END PDF SUMMARY

DOCX 1

PDF 2

The quality of the summary depends very much on the way in which the original document is structured and written. The next example demonstrates a new generated summary, without any additional correction, but applied on a document having a different structure and generating a better quality text.

{% include elements/pdf-summary.html 
    file="_experiments/pdf-summary/pt.pdf"
    btnType="danger"
    btnOutline="false"
    btnText="Download PDF"
    sBorder="true"
    sh="300px"
%}

START PDF SUMMARY

RonPub Journal Paper Template

This article is not a scientific paper, but a template file and guidelines for helping authors prepare their scientific papers. A number of nice paper templates have been developed, like ACM templates [1] and IEEE templates[7] [8]. However, these templates and guidelines do not fit the features and needs of our journals.

We encourage authors to have a look at an excellent handbook for science writing. Main text is a large as well as major part of a paper. Author's research work, e.g. the new techniques, is presented in this part. Application papers may include an implementation (sub-) section. Discussion and Conclusion should explore the significance of the research results.

The main text should be divided into clearly defined and numbered sections. A separate section is needed to describe the structure and content guidelines for the main text. The formatting and styling has been setup for multiple sections, subsections and subsubsections in this template document.

As a minimum for web references, the full URL should be given as well as the date when the reference was last accessed. Appendices are optional. They should be placed after the references and before the author biographies.

If there is more than one appendix, they should be identified as A, B, etc. An example of a conference paper can be found at ICEIS 2008 in Barcelona, Spain.

END PDF SUMMARY

DOCX 2

PDF 3

The next example is based on a generated summary followed by manual corrections as the model was not capable to correctly detect the document sections or was not always able to split the summary in logical sentences.

{% include elements/pdf-summary.html 
    file="_experiments/pdf-summary/pe.pdf"
    btnType="danger"
    btnOutline="false"
    btnText="Download PDF"
    sBorder="true"
    sh="300px"
%}

START PDF SUMMARY

Intro

Paper must be in one Columns after Authors Name. All Sub Heading must be in Title Case, Left 0.25 cm, Italic, and Alphabet Numbering (A, B, C…etc.) Paper Title must be in Font Size 20 with Single Line Spacing.

Submit your manuscript electronically for review. Use words rather than symbols. Put units in parentheses.

Do not label axes only with units. When you submit your final version, after your paper has been accepted, prepare it in two-column format. For example, to insert images in Word, position the cursor at C.

Units

Write “ temperature (K),” not “Temperature/K.” Write “Magnetization (kA/m)” or “ magnetization [103 A/m]” Figure labels should be legible, approximately 8 to 12 point type. Abbreviations such as SI, ac, and dc do not have to be defined. Use a zero before decimal points: “0.25,” not “.25.” Use one space between number and unit: 0.1 cm, not 0.1cm. When expressing a range of values, write “7 to 9” or ‘7-9’. Not mix complete spellings and abbreviations of units. Avoid contractions; for example, write “do not” instead of “don’t.” The serial comma is preferred: “A, B, and C” instead of ‘a, B and C’

Wording

The word “data” is plural, not singular. The term for residual magnetization is “remanence”. Do not use the word ‘essentially’ to mean “approximately” or “effectively” A graph within a graph is an “inset,” not an “insert”. Authors should expect to be challenged by reviewers if the results are not supported by adequate data and critical details. Be aware of the different meanings of the homophones “affect” (usually a verb) and “effect”. Do not confuse “imply” and “infer”. Prefixes such as “non,” “sub“‚ “micro“, “multi“, and “ultra“ are not independent words.

Author bio

The first paragraph should contain a place and/or date of birth. The second paragraph uses the pronoun of the person (he or she) and not the author’s last name. It lists military and work experience, including summer and fellowship jobs.

Current and previous research interests end the third paragraph. The third paragraph begins with the author’s title and last name (e.g., Dr.Smith, Prof. Jones, Mr. Kajor, Ms.Hunter).

List any memberships in professional societies other than the IAENG. If a photograph is provided, the biography will be indented around it.

The photograph is placed at the top of the biography. Personal hobbies will be deleted from the biography if they are listed.

END PDF SUMMARY

DOCX 3

There is no general rule to establish if a certain PDF will generate a better quality summary. The best way to use this experiment is to start with the auto-generated summary, review it with care and apply manual corrections where needed.

Corrections

As described, corrections can be manually applied to the generated summaries for removing not relevant paragraphs or model hallucinations. The summarisation algorithm is set to generate rather detailed summaries to allow choosing the relevant parts for rendering to the document. However, pdf summarisation is usually less structured than docx ones because structure extraction from a pdf is looser than in docx where it can be hooked to the headings from the docx. On the other hand, pdf summaries may contain more relevant details.

When working with docx, it is a good practice to compare the summary generated for the original docx with the summary generated for the same document converted to pdf and choose the one that suits better your purpose.

Limitations

Tables, images and sections like Table of Contents, List of Tables, List of Figures are usually excluded from summarisation. To the best extent possible, annotation, comments, references, bibliography, citations and similar sections are also excluded.

The models may not always detect this kind of sections in an accurate way, this being another reason for which we strongly recommend checking the generated summaries and apply manual corrections as needed. Let’s say that summarisation makes up to 90% of the job, but the remaining can make the difference.

Parameters

file: path to the PDF file provided as relative path from the root of doc-contents folder
btnType: type of the download button, default value is primary. See Downloads.
btnOutline: type of the outline of the download button, default value is false. See Downloads.
btnText: text on the download button, default value is Download. See Downloads. Note that there is not automatic translation of this label. Since it can be set as parameter, is needed to be manually adapted to the site language.
sBorder: the PDF summary block has or not a thin left border, default value is false
sh: the PDF summary block has or not a fixed height, default value is auto.

Tip

Since summaries can be pretty long sometimes, we recommend to set a fixed height to increase the readibility of the document and the UX when reading.