LibXmlParser.pas
上传用户:ctlcnc
上传日期:2021-12-10
资源大小:4933k
文件大小:101k
- (**
- ===============================================================================================
- Name : LibXmlParser
- ===============================================================================================
- Project : All Projects
- ===============================================================================================
- Subject : Progressive XML Parser for all types of XML Files
- ===============================================================================================
- Author : Stefan Heymann
- Eschenweg 3
- 72076 T黚ingen
- GERMANY
- E-Mail: stefan@destructor.de
- URL: www.destructor.de
- ===============================================================================================
- Source, Legals ("Licence")
- --------------------------
- The official site to get this parser is http://www.destructor.de/
- Usage and Distribution of this Source Code is ruled by the
- "Destructor.de Source code Licence" (DSL) which comes with this file or
- can be downloaded at http://www.destructor.de/
- IN SHORT: Usage and distribution of this source code is free.
- You use it completely on your own risk.
- Postcardware
- ------------
- If you like this code, please send a postcard of your city to my above address.
- ===============================================================================================
- !!! All parts of this code which are not finished or not conforming exactly to
- the XmlSpec are marked with three exclamation marks
- -!- Parts where the parser may be able to detect errors in the document's syntax are
- marked with the dash-exlamation mark-dash sequence.
- ===============================================================================================
- Terminology:
- ------------
- - Start: Start of a buffer part
- - Final: End (last character) of a buffer part
- - DTD: Document Type Definition
- - DTDc: Document Type Declaration
- - XMLSpec: The current W3C XML Recommendation (version 1.0 as of 1998-02-10), Chapter No.
- - Cur*: Fields concerning the "Current" part passed back by the "Scan" method
- ===============================================================================================
- Scanning the XML document
- -------------------------
- - Create TXmlParser Instance MyXml := TXmlParser.Create;
- - Load XML Document MyXml.LoadFromFile (Filename);
- - Start Scanning MyXml.StartScan;
- - Scan Loop WHILE MyXml.Scan DO
- - Test for Part Type CASE MyXml.CurPartType OF
- - Handle Parts ... : ;;;
- - Handle Parts ... : ;;;
- - Handle Parts ... : ;;;
- END;
- - Destroy MyXml.Free;
- ===============================================================================================
- Loading the XML document
- ------------------------
- You can load the XML document from a file with the "LoadFromFile" method.
- It is beyond the scope of this parser to perform HTTP or FTP accesses. If you want your
- application to handle such requests (URLs), you can load the XML via HTTP or FTP or whatever
- protocol and hand over the data buffer using the "LoadFromBuffer" or "SetBuffer" method.
- "LoadFromBuffer" loads the internal buffer of TXmlParser with the given null-terminated
- string, thereby creating a copy of that buffer.
- "SetBuffer" just takes the pointer to another buffer, which means that the given
- buffer pointer must be valid while the document is accessed via TXmlParser.
- ===============================================================================================
- Encodings:
- ----------
- This XML parser kind of "understands" the following encodings:
- - UTF-8
- - ISO-8859-1
- - Windows-1252
- Any flavor of multi-byte characters (and this includes UTF-16) is not supported. Sorry.
- Every string which has to be passed to the application passes the virtual method
- "TranslateEncoding" which translates the string from the current encoding (stored in
- "CurEncoding") into the encoding the application wishes to receive.
- The "TranslateEncoding" method that is built into TXmlParser assumes that the application
- wants to receive Windows ANSI (Windows-1252, about the same as ISO-8859-1) and is able
- to convert UTF-8 and ISO-8859-1 encodings.
- For other source and target encodings, you will have to override "TranslateEncoding".
- ===============================================================================================
- Buffer Handling
- ---------------
- - The document must be loaded completely into a piece of RAM
- - All character positions are referenced by PChar pointers
- - The TXmlParser instance can either "own" the buffer itself (then, FBufferSize is > 0)
- or reference the buffer of another instance or object (then, FBuffersize is 0 and
- FBuffer is not NIL)
- - The Property DocBuffer passes back a pointer to the first byte of the document. If there
- is no document stored (FBuffer is NIL), the DocBuffer returns a pointer to a NULL character.
- ===============================================================================================
- Whitespace Handling
- -------------------
- The TXmlParser property "PackSpaces" determines how Whitespace is returned in Text Content:
- While PackSpaces is true, all leading and trailing whitespace characters are trimmed of, all
- Whitespace is converted to Space #x20 characters and contiguous Whitespace characters are
- compressed to one.
- If the "Scan" method reports a ptContent part, the application can get the original text
- with all whitespace characters by extracting the characters from "CurStart" to "CurFinal".
- If the application detects an xml:space attribute, it can set "PackSpaces" accordingly or
- use CurStart/CurFinal.
- Please note that TXmlParser does _not_ normalize Line Breaks to single LineFeed characters
- as the XmlSpec requires (XmlSpec 2.11).
- The xml:space attribute is not handled by TXmlParser. This is on behalf of the application.
- ===============================================================================================
- Non-XML-Conforming
- ------------------
- TXmlParser does not conform 100 % exactly to the XmlSpec:
- - UTF-16 is not supported (XmlSpec 2.2)
- (Workaround: Convert UTF-16 to UTF-8 and hand the buffer over to TXmlParser)
- - As the parser only works with single byte strings, all Unicode characters > 255
- can currently not be handled correctly.
- - Line breaks are not normalized to single Linefeed #x0A characters (XmlSpec 2.11)
- (Workaround: The Application can access the text contents on its own [CurStart, CurFinal],
- thereby applying every normalization it wishes to)
- - The attribute value normalization does not work exactly as defined in the
- Second Edition of the XML 1.0 specification.
- - See also the code parts marked with three consecutive exclamation marks. These are
- parts which are not finished in the current code release.
- This list may be incomplete, so it may grow if I get to know any other points.
- As work on the parser proceeds, this list may also shrink.
- ===============================================================================================
- Things Todo
- -----------
- - Introduce a new event/callback which is called when there is an unresolvable
- entity or character reference
- - Support Unicode
- - Use Streams instead of reading the whole XML into memory
- ===============================================================================================
- Change History, Version numbers
- -------------------------------
- The Date is given in ISO Year-Month-Day (YYYY-MM-DD) order.
- Versions are counted from 1.0.0 beginning with the version from 2000-03-16.
- Unreleased versions don't get a version number.
- Date Author Version Changes
- -----------------------------------------------------------------------------------------------
- 2000-03-16 HeySt 1.0.0 Start
- 2000-03-28 HeySt 1.0.1 Initial Publishing of TXmlParser on the destructor.de Web Site
- 2000-03-30 HeySt 1.0.2 TXmlParser.AnalyzeCData: Call "TranslateEncoding" for CurContent
- 2000-03-31 HeySt 1.0.3 Deleted the StrPosE function (was not needed anyway)
- 2000-04-04 HeySt 1.0.4 TDtdElementRec modified: Start/Final for all Elements;
- Should be backwards compatible.
- AnalyzeDtdc: Set CurPartType to ptDtdc
- 2000-04-23 HeySt 1.0.5 New class TObjectList. Eliminated reference to the Delphi 5
- "Contnrs" unit so LibXmlParser is Delphi 4 compatible.
- 2000-07-03 HeySt 1.0.6 TNvpNode: Added Constructor
- 2000-07-11 HeySt 1.0.7 Removed "Windows" from USES clause
- Added three-exclamation-mark comments for Utf8ToAnsi/AnsiToUtf8
- Added three-exclamation-mark comments for CHR function calls
- 2000-07-23 HeySt 1.0.8 TXmlParser.Clear: CurAttr.Clear; EntityStack.Clear;
- (This was not a bug; just defensive programming)
- 2000-07-29 HeySt 1.0.9 TNvpList: Added methods: Node(Index), Value(Index), Name(Index);
- 2000-10-07 HeySt Introduced Conditional Defines
- Uses Contnrs unit and its TObjectList class again for
- Delphi 5 and newer versions
- 2001-01-30 HeySt Introduced Version Numbering
- Made LoadFromFile and LoadFromBuffer BOOLEAN functions
- Introduced FileMode parameter for LoadFromFile
- BugFix: TAttrList.Analyze: Must add CWhitespace to ExtractName call
- Comments worked over
- 2001-02-28 HeySt 1.0.10 Completely worked over and tested the UTF-8 functions
- Fixed a bug in TXmlParser.Scan which caused it to start over when it
- was called after the end of scanning, resulting in an endless loop
- TEntityStack is now a TObjectList instead of TList
- 2001-07-03 HeySt 1.0.11 Updated Compiler Version IFDEFs for Kylix
- 2001-07-11 HeySt 1.0.12 New TCustomXmlScanner component (taken over from LibXmlComps.pas)
- 2001-07-14 HeySt 1.0.13 Bugfix TCustomXmlScanner.FOnTranslateEncoding
- 2001-10-22 HeySt Don't clear CurName anymore when the parser finds a CDATA section.
- 2001-12-03 HeySt 1.0.14 TObjectList.Clear: Make call to INHERITED method (fixes a memory leak)
- 2001-12-05 HeySt 1.0.15 TObjectList.Clear: removed call to INHERITED method
- TObjectList.Destroy: Inserted SetCapacity call.
- Reduces need for frequent re-allocation of pointer buffer
- Dedicated to my father, Theodor Heymann
- 2002-06-26 HeySt 1.0.16 TXmlParser.Scan: Fixed a bug with PIs whose name is beginning
- with 'xml'. Thanks to Uwe Kamm for submitting this bug.
- The CurEncoding property is now always in uppercase letters (the XML
- spec wants it to be treated case independently so when it's uppercase
- comparisons are faster)
- 2002-03-04 HeySt 1.0.17 Included an IFDEF for Delphi 7 (VER150) and Kylix
- There is a new symbol HAS_CONTNRS_UNIT which is used now to
- distinguish between IDEs which come with the Contnrs unit and
- those that don't.
- *)
- // --- Delphi/Kylix Version Numbers
- // As this is no code, this does not blow up your object or executable code at all
- (*$IFDEF LINUX *)
- (*$DEFINE K1_OR_NEWER *)
- (*$ENDIF *)
- (*$IFDEF MSWINDOWS *)
- (*$DEFINE D1_OR_NEWER *)
- (*$IFNDEF VER80 *)
- (*$DEFINE D2_OR_NEWER *)
- (*$IFNDEF VER90 *)
- (*$DEFINE D3_OR_NEWER *)
- (*$IFNDEF VER100 *)
- (*$DEFINE D4_OR_NEWER *)
- (*$IFNDEF VER120 *)
- (*$DEFINE D5_OR_NEWER *)
- (*$IFNDEF VER130 *)
- (*$IFNDEF VER140 *)
- (*$IFNDEF VER150 *)
- { If the compiler gets stuck here,
- you are using a compiler version unknown to this code.
- You will probably have to change this code accordingly.
- At first, try to comment out these lines and see what will happen.}
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$ENDIF *)
- (*$IFDEF D5_OR_NEWER *)
- (*$DEFINE HAS_CONTNRS_UNIT *)
- (*$ENDIF *)
- (*$IFDEF K1_OR_NEWER *)
- (*$DEFINE HAS_CONTNRS_UNIT *)
- (*$ENDIF *)
- unit LibXmlParser;
- interface
- uses
- SysUtils, Classes,
- (*$IFDEF HAS_CONTNRS_UNIT *)// The Contnrs Unit was introduced in Delphi 5
- Contnrs,
- (*$ENDIF*)
- Math;
- const
- CVersion = '1.0.17'; // This variable will be updated for every release
- // (I hope, I won't forget to do it everytime ...)
- type
- TPartType = // --- Document Part Types
- (ptNone, // Nothing
- ptXmlProlog, // XML Prolog XmlSpec 2.8 / 4.3.1
- ptComment, // Comment XmlSpec 2.5
- ptPI, // Processing Instruction XmlSpec 2.6
- ptDtdc, // Document Type Declaration XmlSpec 2.8
- ptStartTag, // Start Tag XmlSpec 3.1
- ptEmptyTag, // Empty-Element Tag XmlSpec 3.1
- ptEndTag, // End Tag XmlSpec 3.1
- ptContent, // Text Content between Tags
- ptCData); // CDATA Section XmlSpec 2.7
- TDtdElemType = // --- DTD Elements
- (deElement, // !ELEMENT declaration
- deAttList, // !ATTLIST declaration
- deEntity, // !ENTITY declaration
- deNotation, // !NOTATION declaration
- dePI, // PI in DTD
- deComment, // Comment in DTD
- deError); // Error found in the DTD
- type
- TAttrList = class;
- TEntityStack = class;
- TNvpList = class;
- TElemDef = class;
- TElemList = class;
- TEntityDef = class;
- TNotationDef = class;
- TDtdElementRec = record
- // --- This Record is returned by the DTD parser callback function
- Start, Final: PChar;
- // Start/End of the Element's Declaration
- case ElementType: TDtdElemType of
- // Type of the Element
- deElement, // <!ELEMENT>
- deAttList: (ElemDef: TElemDef); // <!ATTLIST>
- deEntity: (EntityDef: TEntityDef); // <!ENTITY>
- deNotation: (NotationDef: TNotationDef); // <!NOTATION>
- dePI: (Target: PChar; // <?PI ?>
- Content: PChar;
- AttrList: TAttrList);
- deError: (Pos: PChar); // Error
- // deComment : ((No additional fields here)); // <!-- Comment -->
- end;
- TXmlParser = class // --- Internal Properties and Methods
- protected
- FBuffer: PChar; // NIL if there is no buffer available
- FBufferSize: integer;
- // 0 if the buffer is not owned by the Document instance
- FSource: string;
- // Name of Source of document. Filename for Documents loaded with LoadFromFile
- FXmlVersion: string;
- // XML version from Document header. Default is '1.0'
- FEncoding: string;
- // Encoding from Document header. Default is 'UTF-8'
- FStandalone: boolean;
- // Standalone declaration from Document header. Default is 'yes'
- FRootName: string; // Name of the Root Element (= DTD name)
- FDtdcFinal: PChar;
- // Pointer to the '>' character terminating the DTD declaration
- FNormalize: boolean;
- // If true: Pack Whitespace and don't return empty contents
- EntityStack: TEntityStack;
- // Entity Stack for Parameter and General Entities
- FCurEncoding: string;
- // Current Encoding during parsing (always uppercase)
- procedure AnalyzeProlog;
- // Analyze XML Prolog or Text Declaration
- procedure AnalyzeComment(Start: PChar; var Final: PChar);
- // Analyze Comments
- procedure AnalyzePI(Start: PChar; var Final: PChar);
- // Analyze Processing Instructions (PI)
- procedure AnalyzeDtdc;
- // Analyze Document Type Declaration
- procedure AnalyzeDtdElements(Start: PChar; var Final: PChar);
- // Analyze DTD declarations
- procedure AnalyzeTag;
- // Analyze Start/End/Empty-Element Tags
- procedure AnalyzeCData;
- // Analyze CDATA Sections
- procedure AnalyzeText(var IsDone: boolean);
- // Analyze Text Content between Tags
- procedure AnalyzeElementDecl(Start: PChar; var Final: PChar);
- procedure AnalyzeAttListDecl(Start: PChar; var Final: PChar);
- procedure AnalyzeEntityDecl(Start: PChar; var Final: PChar);
- procedure AnalyzeNotationDecl(Start: PChar; var Final: PChar);
- procedure PushPE(var Start: PChar);
- procedure ReplaceCharacterEntities(var Str: string);
- procedure ReplaceParameterEntities(var Str: string);
- procedure ReplaceGeneralEntities(var Str: string);
- function GetDocBuffer: PChar;
- // Returns FBuffer or a pointer to a NUL char if Buffer is empty
- public // --- Document Properties
- property XmlVersion: string Read FXmlVersion;
- // XML version from the Document Prolog
- property Encoding: string Read FEncoding;
- // Document Encoding from Prolog
- property Standalone: boolean Read FStandalone;
- // Standalone Declaration from Prolog
- property RootName: string Read FRootName;
- // Name of the Root Element
- property Normalize: boolean Read FNormalize Write FNormalize;
- // True if Content is to be normalized
- property Source: string Read FSource;
- // Name of Document Source (Filename)
- property DocBuffer: PChar Read GetDocBuffer;
- // Returns document buffer
- public // --- DTD Objects
- Elements: TElemList;
- // Elements: List of TElemDef (contains Attribute Definitions)
- Entities: TNvpList; // General Entities: List of TEntityDef
- ParEntities: TNvpList; // Parameter Entities: List of TEntityDef
- Notations: TNvpList; // Notations: List of TNotationDef
- public
- constructor Create;
- destructor Destroy; override;
- // --- Document Handling
- function LoadFromFile(Filename: string;
- FileMode: integer =
- fmOpenRead or fmShareDenyNone): boolean;
- // Loads Document from given file
- function LoadFromBuffer(Buffer: PChar): boolean;
- // Loads Document from another buffer
- procedure SetBuffer(Buffer: PChar);
- // References another buffer
- procedure Clear;
- // Clear Document
- public
- // --- Scanning through the document
- CurPartType: TPartType; // Current Type
- CurName: string; // Current Name
- CurContent: string;
- // Current Normalized Content
- CurStart: PChar;
- // Current First character
- CurFinal: PChar;
- // Current Last character
- CurAttr: TAttrList;
- // Current Attribute List
- property CurEncoding: string Read FCurEncoding; // Current Encoding
- procedure StartScan;
- function Scan: boolean;
- // --- Events / Callbacks
- function LoadExternalEntity(SystemId, PublicId,
- Notation: string): TXmlParser;
- virtual;
- function TranslateEncoding(const Source: string): string;
- virtual;
- procedure DtdElementFound(DtdElementRec: TDtdElementRec);
- virtual;
- end;
- TValueType = // --- Attribute Value Type
- (vtNormal, // Normal specified Attribute
- vtImplied, // #IMPLIED attribute value
- vtFixed, // #FIXED attribute value
- vtDefault);
- // Attribute value from default value in !ATTLIST declaration
- TAttrDefault = // --- Attribute Default Type
- (adDefault, // Normal default value
- adRequired, // #REQUIRED attribute
- adImplied, // #IMPLIED attribute
- adFixed); // #FIXED attribute
- TAttrType = // --- Type of attribute
- (atUnknown, // Unknown type
- atCData, // Character data only
- atID, // ID
- atIdRef, // ID Reference
- atIdRefs, // Several ID References, separated by Whitespace
- atEntity, // Name of an unparsed Entity
- atEntities,
- // Several unparsed Entity names, separated by Whitespace
- atNmToken, // Name Token
- atNmTokens, // Several Name Tokens, separated by Whitespace
- atNotation, // A selection of Notation names (Unparsed Entity)
- atEnumeration); // Enumeration
- TElemType = // --- Element content type
- (etEmpty, // Element is always empty
- etAny,
- // Element can have any mixture of PCDATA and any elements
- etChildren, // Element must contain only elements
- etMixed); // Mixed PCDATA and elements
- (*$IFDEF HAS_CONTNRS_UNIT *)
- TObjectList = Contnrs.TObjectList; // Re-Export this identifier
- (*$ELSE *)
- TObjectList = class(TList)
- destructor Destroy; override;
- procedure Delete(Index: integer);
- procedure Clear; override;
- end;
- (*$ENDIF *)
- TNvpNode = class // Name-Value Pair Node
- Name: string;
- Value: string;
- constructor Create(TheName: string = ''; TheValue: string = '');
- end;
- TNvpList = class(TObjectList) // Name-Value Pair List
- procedure Add(Node: TNvpNode);
- function Node(Name: string): TNvpNode; overload;
- function Node(Index: integer): TNvpNode; overload;
- function Value(Name: string): string; overload;
- function Value(Index: integer): string; overload;
- function Name(Index: integer): string;
- end;
- TAttr = class(TNvpNode) // Attribute of a Start-Tag or Empty-Element-Tag
- ValueType: TValueType;
- AttrType: TAttrType;
- end;
- TAttrList = class(TNvpList) // List of Attributes
- procedure Analyze(Start: PChar; var Final: PChar);
- end;
- TEntityStack = class(TObjectList)
- // Stack where current position is stored before parsing entities
- protected
- Owner: TXmlParser;
- public
- constructor Create(TheOwner: TXmlParser);
- procedure Push(LastPos: PChar); overload;
- procedure Push(Instance: TObject; LastPos: PChar); overload;
- function Pop: PChar;
- // Returns next char or NIL if EOF is reached. Frees Instance.
- end;
- TAttrDef = class(TNvpNode)
- // Represents a <!ATTLIST Definition. "Value" is the default value
- TypeDef: string; // Type definition from the DTD
- Notations: string;
- // Notation List, separated by pipe symbols '|'
- AttrType: TAttrType; // Attribute Type
- DefaultType: TAttrDefault; // Default Type
- end;
- TElemDef = class(TNvpList)
- // Represents a <!ELEMENT Definition. Is a list of TAttrDef-Nodes
- Name: string; // Element name
- ElemType: TElemType; // Element type
- Definition: string; // Element definition from DTD
- end;
- TElemList = class(TObjectList) // List of TElemDef nodes
- function Node(Name: string): TElemDef;
- procedure Add(Node: TElemDef);
- end;
- TEntityDef = class(TNvpNode) // Represents a <!ENTITY Definition.
- SystemId: string;
- PublicId: string;
- NotationName: string;
- end;
- TNotationDef = class(TNvpNode)
- // Represents a <!NOTATION Definition. Value is the System ID
- PublicId: string;
- end;
- TCharset = set of char;
- const
- CWhitespace = [#32, #9, #13, #10];
- // Whitespace characters (XmlSpec 2.3)
- CLetter = [#$41..#$5A, #$61..#$7A, #$C0..#$D6, #$D8..#$F6, #$F8..#$FF];
- CDigit = [#$30..#$39];
- CNameChar = CLetter + CDigit + ['.', '-', '_', ':', #$B7];
- CNameStart = CLetter + ['_', ':'];
- CQuoteChar = ['"', ''''];
- CPubidChar = [#32, ^M, ^J, #9, 'a'..'z', 'A'..'Z', '0'..'9',
- '-', '''', '(', ')', '+', ',', '.', '/', ':',
- '=', '?', ';', '!', '*', '#', '@', '$', '_', '%'];
- CDStart = '<![CDATA[';
- CDEnd = ']]>';
- // --- Name Constants for the above enumeration types
- CPartType_Name: array [TPartType] of string =
- ('', 'XML Prolog', 'Comment', 'PI',
- 'DTD Declaration', 'Start Tag', 'Empty Tag', 'End Tag',
- 'Text', 'CDATA');
- CValueType_Name: array [TValueType] of string =
- ('Normal', 'Implied', 'Fixed', 'Default');
- CAttrDefault_Name: array [TAttrDefault] of string =
- ('Default', 'Required', 'Implied', 'Fixed');
- CElemType_Name: array [TElemType] of string =
- ('Empty', 'Any', 'Childs only', 'Mixed');
- CAttrType_Name: array [TAttrType] of string = ('Unknown', 'CDATA',
- 'ID', 'IDREF', 'IDREFS',
- 'ENTITY', 'ENTITIES',
- 'NMTOKEN', 'NMTOKENS',
- 'Notation', 'Enumeration');
- function ConvertWs(Source: string; PackWs: boolean): string;
- // Convert WS to spaces #x20
- procedure SetStringSF(var S: string; BufferStart, BufferFinal: PChar);
- // SetString by Start/Final of buffer
- function StrSFPas(Start, Finish: PChar): string;
- // Convert buffer part to Pascal string
- function TrimWs(Source: string): string;
- // Trim Whitespace
- function AnsiToUtf8(Source: ansistring): string;
- // Convert Win-1252 to UTF-8
- function Utf8ToAnsi(Source: string; UnknownChar: char = '