Digital Maktaba

AI-Driven Digital Library for Non-Latin Scripts

Digital Maktaba is a web platform for the acquisition, digitisation, cataloguing, and access of documentary heritage in non-Latin scripts, with a focus on Arabic, Persian, and Azerbaijani.

  • The system combines Optical Character Recognition (OCR), Natural Language Processing (NLP), and ad hoc technologies developed to support librarians in managing large-scale collections that traditional tools cannot process.
  • Developed within the PNRR-funded ITSERR project (Italian Strengthening of the ESFRI RI RESILIENCE), Digital Maktaba follows the “AI in the loop, human in charge” paradigm: the system proposes automatic cataloguing suggestions, while domain experts retain full control over validation and correction.
  • The platform is open source under the GNU General Public License v3.0 and is deployed as a containerised stack (Docker Compose) with Keycloak-based authentication via D4Science.

Feature

Digital Maktaba provides an AI-assisted cataloguing workflow that supports librarians at every step, from automated content extraction to final validation. With scalable OCR and multi-language support, it can process large and diverse collections while maintaining high accuracy. Automatic suggestions combined with human review ensure consistency and reliability in the catalogue. All of this is built on a modern, secure infrastructure designed for efficient management of complex digital libraries.

Guided workflow for title, author, and category extraction. Bounding boxes are editable on-screen, with integrated Arabic keyboard and dictionary checks. The librarian validates every suggestion before it enters the catalogue.

Keyword-based matching against user-defined category sets proposes leaf-level catalogue categories from extracted text. Librarians review and confirm suggestions before they are committed.

Automated processing of PDF and image collections (PDF, PNG, JPG/JPEG) using EasyOCR with PyTorch, PyMuPDF, and pdf2image for I/O.

Supports Arabic, Persian, and Azerbaijani Turkish scripts. Automatic language detection (langid) and Arabic-specific NLP processing via Tashaphyne and arramooz-pysqlite for morphological analysis, stemming, and dictionary checks.

Two user roles: Librarian (ingest, catalogue, validate, edit) and Researcher (search, browse, consult). Authentication via Keycloak (OIDC) integrated with the D4Science infrastructure.

Full-text and metadata search across the entire library. Document detail view with page-level navigation. Filterable library page for browsing the collection.

Built on PostgreSQL for persistence, with NiceGUI as the web framework. Containerised deployment via Docker Compose. Uploads stored on mounted volumes with structured subfolders. Background processing via APScheduler for asynchronous OCR and ingestion tasks.