 Last modified:
        Monday, 20-May-2019 01:31:03 UTC. Maintained by:  Elisa E. Beshero-Bondar
        (eeb4 at psu.edu). Powered by firebellies.
 Last modified:
        Monday, 20-May-2019 01:31:03 UTC. Maintained by:  Elisa E. Beshero-Bondar
        (eeb4 at psu.edu). Powered by firebellies.XMLanyway?
XML is short for eXtensible Markup Language,and it’s a standard system for
         storing and accessing information used practically everywhere around the world. It’s the
         informational markup (or code
) that makes Microsoft Office and Blackboard software run, and it’s
         the foundation of many online network applications around the world. For our purposes as
         researchers, it’s an excellent method for storing information, and for preparing to share
         it with the public. XML is independent of proprietary software applications—which means
         that what you write in XML is freely exchangeable between computers of
         different kinds (across platform—as in Macs and PCs). It outlasts software obsolescence,
         because it’s a standard that can be read universally. 
You’ve probably heard of HTML (hypertext markup language), which is the code that makes web pages presentable in web browsers. That’s a kind of Markup Language, too, designed specifically and only for presentation and publication on the world-wide web: HTML is for presentation and display, but XML is primarily for storage of information, and we can call it informational markup, as opposed to presentational markup. We can write code to take information written in XML and transform it for presentation online—and you’ll gain experience with doing that as we proceed with our class this semester.
XML is great for researchers in the Humanities and Social Sciences because it’s very effective at storing and cataloging information systematically. You can write XML to set up hierarchies (or nested structures) of information, and also to locate and extract that information later when you need it. So, if we were going to store a book in XML, we’d pay attention to the way it’s structured, maybe with chapters—and inside those chapters we’d have chapter titles, and paragraphs, and inside those, sentences, and then unit words and punctuation. If wanted to, we could systematically mark all the action verbs in those sentences, and all the exclamation points using XML, if this was important—and we could design a hierarchical system using XML to capture and hold the information we want to collect.
When we do research in the humanities, we’re working with documents written by human
         beings, and XML is useful for preserving them for reading and studying, and for extracting
         information from them later. We can do this close-up (through "close
            reading" by reading with our eyes, one by one. Or we can code documents
         systematically (which also involves close reading
), in order to step BACK and view
         them from a distance: to let a computer discover patterns we couldn’t so easily see
            on our own. In Digital Humanities, this practice of working with computers to
         make them show us patterns across enormous, complicated texts or many, many texts, is
         called distant reading
. XML helps us prepare texts for this, for two
         reasons: 
We have to start by studying our documents to see how they’re structured, and identify what matters to us in describing a structure. This practice is called document analysis, and it’s basically what you’re doing when you have to make decisions about how to code a recipe, a voyage log, a poem, or a letter in our first XML exercises in this course.
Tree
In technical terms, we can think of every XML document as a tree, sprouting from a single root, which contains and identifies the whole thing. That outermost layer is the start-tag and end-tag, the alpha and the omega of an XML file. I tend to think of this as a single box that contains everything else, with all its branching complexity inside.
XML marks a structure, or the hierarchy of a document, by using elements, such as shopping_list, and food_item. Each element consists of the following:

A start tag is defined with angle brackets, and an end tag looks like a start tag, except it has a forward slash after the opening angle bracket. When we refer to tags, we’re talking about those start and end tags. When we talk about an element, we’re referring to the whole thing: start-tag, CONTENT, and end-tag. Make sense?
Elements may also include something called attributes—an additional markup that gives supplementary information about an element. So, say we had ingredient names in French and Spanish in our shopping list, and we wanted to mark those: One option for doing this would be, say, like this:
         <foreign language="French">escargot</foreign>
            <foreign language="Spanish">sofrito</foreign>
      
See how this works: Attributes are written inside a start tag of an element (but NOT inside
         the end-tag). They consist of an attribute name and an attribute
            value. The attribute name, here, is language, and the attribute value is
         (you guessed it!) French
 or Spanish.
 (Attributes are sort of like adjectives,
         or descriptive modifiers!) Notice there’s a rule for HOW to write attributes: attribute
         values must be enclosed inside quotation marks—These can either be double straight
         quotation marks (") or single straight apostrophes (’). either one works, but try to use
         them consistently. Later on, when we’re writing other kinds of code that reads and extracts
         from your XML, you’ll find that you need to work with single quotes to refer to
         attribute values—more on that later. For now, as you write XML, double quotes are what we
         commonly use. Note that these are straight quotation marks (") and not the curly ones that
         you see in a word processor. 
In special cases, XML elements can actually have no text content at all! These are called self-closing elements and they have a special syntax so that they open and close inside a single tag. Here is an example:
Here is one use-case for a self-closing tag: We are using it to contain information about where a line of poetry ends, because our XML markup would not make that clear. The line numbering is not literally in the poem we are coding, but we want to record the information about the line ending in the appropriate place:
            <poem> 
               I think that I shall never see<lb n="1"/>
               A poem lovely as a tree<lb n="2"/>
            </poem>
         
      
      This shows us a use of markup that does not simply wrap around text, but stores information that will be useful to us later in processing the file. Note that we could also have chosen to code the lines like this:
            <poem> 
               <line n="1">I think that I shall never see</line>
               <line n="2">A poem lovely as a tree</line>
            </poem>
         
         
      Both ways of coding the lines of poetry are correct, and might be used for different reasons. If we didn’t hold the information about the lines in some way, whether wrapping each line, or using self-closing tags at the end of the line, the computer parser would simply see an uninterrupted single line of text, with no notion of the meaningful nature of line breaks.
Even though you have spaced this out with hard returns in your oXygen XML editor, to a computer parser, the text itself is a single undifferentiated string, because in XML hard returns and extra spaces appear as meaningless space and are not meant to be treated as stable formatting.
Usually we decide to write self-closing tags when we want to note simple pieces of information and where wrapping the text in open and close tags would actually cause a problem in nesting our elements. 
     We will be discussing cases where we might want to use self-closing elements as we proceed in the course. Often they have to do with preserving well formedness
, which we discuss in the next section.
Well Formedand
Well Formednessin XML
XML must be well formed
 in order to be parsed by a computer. That means
         it must follow the syntax rules for writing XML: It must have a single root
            element, and its elements must be nested, without any overlap.
         Also, where attributes are used, these need to be signalled according to expected XML
         syntax (as above). These are necessary for the document to be XML. Well-formed XML is
         simply, correct XML. 
The following example is NOT well-formed XML. Can you tell why not? (There are multiple reasons!)
 <dairy>
               <item>milk</item>
                <item>yogurt</item>
               </dairy>
 <snacks>
               <item>chips</item
               <item>pretzels</item>
 </snacks>
               
            
This is NOT well-formed XML either. Why not?
 <paragraph>He responded emphatically in French:
               <emph><foreign
               language="french">oui</emph></foreign>!</paragraph> 
            
Computers (as well as people) need to be able to read XML and tell tags apart from text, to
         distinguish elements from their content. So, we run into formedness problems (problems with
         well-formed XML) when we want to represent certain characters, like left and right angle
         brackets AS text. What if you want to write, as I’m doing here, about code and you need to
         represent tagging AS text? View my page source, look for the example passages and you’ll
         see that I’ve used a work-around solution. What we need to do is escape
         the special characters (or the reserved characters
) that indicate to a
         computer that these are tags. There are three special characters that we need to escape,
         and we do this by replacing them in with character entities which tell the
         computer to display these characters as text only. We must always escape three
            characters. We’ll show you in class how to do this:
      
XML is adaptable and flexible for organizing information, because it is up to the person
         writing it how they want to define their elements, and what they want to
         call them. When people work in XML in communities, though, they’ll work with specific
         tagging conventions in order to easily connect and communicate with each other. For several
         of our projects, we’re working with one of those community languages with XML called
            TEI.
 TEI is both a community and a language within XML with a standard set of
         rules, called a schema. If you work together with a group on an XML
         project, one thing you’ll need to do is define your project’s schema (or work within a
         pre-existing schema like the TEI’s) so that you’re all coding consistently. When a
         project’s XML is correct according to its defined schema, we say that the XML is
            valid
 and we run what’s called a validity check
 to determine this, in
         which we check the XML against the project’s schema file. You’ll be learning a little later
         how to write your own schemas for XML using a language called Relax-NG, but for now, we’d
         like you to get used to the actual writing of XML code and to learn some key concepts about
         it: well-formedness, nesting hierarchy, working around special reserved characters, and
         validity. 
One last thing: In real life, coders write comments to themselves and each other in a special way that sets those comments apart from the code they are writing. As you write XML to share with others (whether for turning in assignments to the instructors of this course or for sharing code with team members or interested people on the web) you can document the decisions you made by writing little messages designed to be ignored by the computer parsing your document, but meant for human readers. Here is how to write a comment in XML:
The only rule for writing comments is that you cannot use angle brackets or double hyphens inside the comment because the computer parser will not be able to tell where your comment starts and ends. It is excellent practice in coding to write messages in comments to remind yourself of a decision or alert someone you are working with about a question or a problem, and we strongly encourage such documentation on every piece of code you write.