The client, a global publisher, manages a broad catalog of Scientific, Technical, and Medical (STM) peer-reviewed journals, medical research titles, and academic books. They publish primary research articles, case reports, annotated chapters, and evidence-based reviews across specialized disciplines. Many of their publications appear in PubMed and are archived in PubMed Central (PMC).
Running simultaneous XML production across dozens of journal titles and book series means a quality failure in one workflow stream can propagate across an entire imprint before it is caught. The client's requirements were shaped by that operational exposure: they needed a production architecture that maintained schema conformance across channels with different platform requirements, absorbed a high monthly page count without sacrificing speed for accuracy, and handled a content portfolio so technically diverse that no single approach could cover it all. Since the client’s content was widely indexed across PubMed and PMC, accessibility, discoverability, and submission readiness were core production requirements.
The client needed JATS/NLM-based XML outputs prepared for PubMed and PubMed Central, with validation support across DTD, XSD, and RNG schema requirements for journals and books.
XML outputs had to follow the structural rules of each publishing system, aggregator, and distribution channel.
The workflow had to support 10,000 to 15,000 pages per month across concurrent journal and book production streams.
Each type of content, such as research articles, case reports, systematic reviews, chapters in books, and supporting front/back matter, required distinct structural treatments.
Accurate XML tagging was required for large reference sections, mathematical equations, multilingual abstracts, layered tables, and embedded figures.
Submission-ready XML bundles optimized for interoperability across scholarly databases, institutional repositories, and abstracting and indexing services.
The scale and technical diversity of the content introduced compounding challenges at every stage of the XML production workflow. Source files arrived in inconsistent formats, the schema environment spanned multiple DTD versions, and the content included structurally complex elements demanding specialized handling. Meeting tight SLA commitments under these conditions required a production architecture built for both precision and volume.
Source File Inconsistency
Manuscripts arrived in five distinct source formats—Word, PDF, LaTeX, InDesign, and legacy XML—each with its own structural conventions and encoding behaviors. Without a consistent baseline, the normalization step had to be both comprehensive and adaptive, capable of identifying and resolving encoding errors, missing assets, and formatting anomalies before they entered the conversion pipeline.
DTD and Schema Management
Multiple JATS/NLM versions, publisher-specific schema extensions, and platform-level DTD variations had to be managed concurrently within a single production environment. A configuration error in any schema environment can introduce systematic errors across an entire journal title, making schema governance a critical, non-optional discipline.
Complex Element Fidelity
Mathematical expressions, chemical formulas, nested table structures, and supplementary media files require specialized handling at every stage of processing. MathML required distinguishing between inline and display equations. Tables with spanning headers and row footnotes required JATS-compliant modeling to preserve their relational meaning. Figure packages had to include captions, accessibility metadata (ALT text), and persistent identifiers—all without loss of formatting.
Reference Structuring and Identifier Verification
Our team had to convert bibliographic references in mixed citation formats into fully tagged XML with element-level granularity. Then we had to cross-check them for references and identifiers in XML (DOIs, PMIDs, and PMCIDs). We had to ensure that the references and identifiers were complete and accurate, as downstream bibliographic linking in scholarly databases depended on them. Any gap in identifier coverage could create downstream linking failures.
Conformance under SLA Pressure
The combination of high volume, strict SLAs, and rigorous PMC conformance requirements left no margin for sequential quality steps, such as XML validation, Schematron checking, and expert editorial review; instead, we run them concurrently and integrate them.
Submission Packaging for Multiple Platforms
We had to package each deliverable in accordance with platform-specific requirements for file naming, manifest structure, and metadata completeness. Improper submission packaging could still have caused ingestion failure even when XML was technically correct, creating rework and SLA risk, which our team wanted to avoid.
Our workflow integrated normalization, tagging, element handling, validation, and packaging into a single continuous pipeline. The goal was to progressively reduce complexity at each stage—so that files arriving for validation were already structurally sound, and those arriving for delivery were guaranteed to load and integrate without errors.
Incoming manuscripts in five distinct formats—Word, InDesign exports, PDF, LaTeX, and legacy XML—were brought to a consistent structural baseline before any XML tagging began. We run automated scans at the normalization stage to detect missing figures, encoding anomalies, malformed tables, and font-dependent symbols.
We mapped articles and book chapters to the JATS/NLM architecture and applied semantic tagging to all structural elements: sections, headings, abstracts, keyword groups, contributor information, affiliations, and cross-references. The outcome was XML that was not only structurally valid but genuinely machine-readable—optimized for indexing, discoverability, and metadata extraction by PubMed and affiliated databases.
We applied MathML encoding to all mathematical expressions and distinguished inline equations from display equations. Tables with merged column headings, row groupings, nested cells, and footnote links were structured through JATS table models to retain their logical relationships. We assembled figure packages with persistent identifiers, captions, alt text, and metadata on usage rights.
We converted all bibliographic references into element-level XML with granular tagging for author names, publication details, and source identifiers. Each reference was cross-checked against the DOI, PMID, and PMCID databases during conversion. This ensured that references and identifiers in XML were complete, consistent, and ready to support bibliographic linking across downstream platforms.
We validated XML files against the applicable DTD, XSD, or RNG schema using Oxygen XML Editor and Altova XML Editor, then checked against PMC-specific Schematron rules via PMC Style Checker and custom QA rulesets. This multi-layer XML validation and compliance checks process verified the presence of mandatory sections, identifier integrity, and cross-reference accuracy—confirming PMC standards compliance before any file left the production environment.
We assembled validated XML outputs into structured packages containing all supporting assets, including multimedia, manifest files, figures, and tables—formatted to the exact naming and structural conventions required by distribution platforms like PubMed Central. Update packages for corrections and errata were handled using the same workflow, ensuring consistency between original submissions and post-publication updates.
PMC Style Checker
Pre-submission conformance checking against PubMed Central structural and metadata requirements
Oxygen XML Editor
Schema-based validation (DTD/XSD/RNG) and specialist XML authoring and inspection environment
Altova XML Editor
Secondary cross-tool schema validation to catch edge-case errors before delivery
ISO Schematron
Business rule enforcement for identifier completeness, cross-references, and mandatory section structure
Custom QA Rulesets
Per-journal and per-imprint publisher-specific validation logic applied at the Quality Assurance stage
At volumes of 10,000–15,000 pages per month, quality assurance could not rely solely on manual inspection—but given the technical complexity of the content, automation alone was also insufficient. The QC framework combined automated validation with human review, applied at the specific points in the workflow where automated tools reached their operational limits.
All XML files were checked against their applicable DTD, XSD, or RNG schema and passed through Schematron rules specific to PMC, which governed structural logic, identifier completeness, and cross-reference integrity. Automated XML quality assurance at this layer—running across every file processed—provided systematic, scalable coverage that manual review could not replicate at monthly production volumes.
We routed complex content, such as intricate table structures, mathematical markup, multilingual abstracts, and chemical formulas to domain-trained specialists for review. These reviewers assessed semantic accuracy and JATS-compliant representation, catching errors that passed schema validation but misrepresented the underlying content.
Sampling depth was calibrated to content risk: files with dense equations, non-standard table configurations, or multilingual content received more intensive editorial review than lower-complexity submissions. This risk-based approach concentrated human review effort where it had the highest impact, without creating bottlenecks across the full production volume.
Every processed file carried a digital audit trail documenting defects identified and resolved. We categorized and linked defects to specific KPIs (which included turnaround times, first-pass yield, and rework rates). This was fed into a continuous Corrective and Preventive Action (CAPA) cycle. This traceability created an evidence base for ongoing quality improvement across the engagement.
By building quality into every stage of the XML production workflow rather than treating it as a final check, the engagement delivered results that held steady throughout the partnership.
10K-15K Pages Processed Every Month
Sustained across all active journal titles and book imprints simultaneously, including during periods of elevated submission volume across multiple concurrent publishing cycles.
99% Timely Delivery across the Engagement
Consistent turnaround across every monthly cycle, sustained through a production architecture designed for volume rather than adapted to it after the fact.
90%+ First-Pass PMC Acceptance Rate
The majority of XML files cleared PubMed Central ingestion in initial submission, eliminating correction cycles and reducing the production overhead carried by the client's in-house teams.
Our team delivers standards-compliant PubMed XML conversion service at the volumes scholarly publishers need—with the quality assurance they can rely on. Along with PubMed XML production, we support businesses with a range of XML transformation services (PRISM, TEI XML, DTBook, DocBook), ePub3 conversion services, and content distribution (via our digital publishing solution, OneRead). Write to us at info@suntecdigital.com to discuss your project requirements.