Introduction to LINQ
>> Aug 25, 2009
What Is LINQ?
LINQ Providers
Query Syntax and Method Syntax
Query Variables
The Structure of Query Expressions
The Standard Query Operators
LINQ to XML
What Is LINQ?
LINQ is a new feature of C# and Visual Basic .NET that integrates into these languages the ability to query data.
In a relational database system, data is organized into nicely normalized tables, and accessed with a very simple but powerful query language.SQL. SQL can work with any set of data in a database because the data is organized into tables, following strict rules.
In a program, as opposed to a database, however, data is stored in class objects or structs that are all vastly different. As a result, there has been no general query language for retrieving data from data structures. The method of retrieving data from objects has always been custom designed as part of the program. With the introduction of LINQ in C# 3.0, however, the ability to query collections of objects has been added to the language. The following are the important high-level characteristics of LINQ:
• LINQ (pronounced link) stands for Language Integrated Query.
• LINQ is an extension of the .NET Framework that allows you to query collections of data in a manner similar to database queries.
• C# 3.0 includes extensions that integrate LINQ into the language, allowing you to query data from databases, collections of program objects, and XML documents.
The following code shows a simple example of using LINQ. In this code, the data source being queried is simply an array of ints. The definition of the query is the statement with the from and select keywords. Although the query is defined in this statement, it is actually performed and used in the foreach statement at the bottom.
This code produces the following output:
2, 5,
LINQ Providers
In the previous example, the data source was simply an array of ints, which is an in-memory object of the program. LINQ, however, can work with many different types of data sources, such as SQL databases, XML documents, and a host of others. For every data source type, however, under the covers there must be a module of code that implements the LINQ queries in terms of that data source type. These code modules are called LINQ providers. The important points about LINQ providers are the following:
• Microsoft provides LINQ providers for a number of common data source types, as shown in Figure 21-1.
• You can use any LINQ-enabled language (C# 3.0 in our case) to query any data source type for which there is a LINQ provider.
• New LINQ providers are constantly being produced by third parties for all sorts of data source types.
Figure 21-1. The architecture of LINQ, the LINQ-enabled languages, and LINQ providers
There are entire books dedicated to LINQ in all its forms and subtleties, but that is clearly beyond the scope of this chapter. Instead, this chapter will introduce you to LINQ and explain how to use it with program objects (LINQ to Objects) and XML (LINQ to XML).
Anonymous Types
Before getting into the details of LINQ’s querying features, I’ll start by covering a feature of C# 3.0 that allows you to create unnamed class types. These are called, not surprisingly, anonymous types.
In Chapter 5 we covered object initializers, which allow you to initialize the fields and properties of a new class instance when using an object-creation expression. Just to remind you, this kind of object-creation expression consists of three components: the keyword new, the class name or constructor, and the object initializer. The object initializer consists of a comma separated list of member initializers between a set of curly braces.
Creating a variable of an anonymous type uses the same form—but without the class name or constructor. The following line of code shows the object-creation expression form of an anonymous type:
The following code shows an example of creating and using an anonymous type. It creates a variable called student, with an anonymous type that has three string properties and one int property. Notice in the WriteLine statement that the instance’s members are accessed just as if they were members of a named type.
This code produces the following output:
Mary Jones, Age 19, Major: History
Important things to know about anonymous types are the following:
• Anonymous types can only be used with local variables—not with class members.
• Since an anonymous type does not have a name, you must use the var keyword as the variable type.
When the compiler encounters the object initializer of an anonymous type, it creates a new class type with a name it constructs. For each member initializer, it infers its type and creates a private variable of that type in the new class, and creates a read/write property to access the variable. The property has the same name as the member initializer. Once the anonymous type is constructed, the compiler creates an object of that type.
Besides the assignment form of member initializers, anonymous type object initializers also allow two other forms: simple identifiers and member access expressions. These two forms are called projection initializers. The following variable declaration shows all three forms. The first member initializer is in the assignment form. The second is an identifier, and the third is a member access expression.
var student = new { Age = 19, Major, Other.Name };
For example, the following code uses all three types. Notice that the projection initializers are defined before the declaration of the anonymous type. Major is a local variable, and Name is a static field of class Other.
This code produces the following output:
Mary Jones, Age 19, Major: History
The projection initializer form of the object initializer just shown has exactly the same result as the assignment form shown here:
var student = new { Age = Age, Name = Other.Name, Major = Major};
Although your code cannot see the anonymous type, it is visible to object browsers. If the compiler encounters another anonymous type with the same parameter names, with the same inferred types, and in the same order, it will reuse the type and create a new instance—not create a new anonymous type.
Query Syntax and Method Syntax
There are two syntactic forms you can use when writing LINQ queries.query syntax and method syntax.
• Query syntax is a declarative form that looks very much like an SQL statement. Query syntax is written in the form of query expressions.
• Method syntax is an imperative form, which uses standard method invocations. The methods are from a set called the standard query operators, which will be described later in the chapter.
• You can also combine both forms in a single query.
Microsoft recommends using query syntax because it’s more readable, and more clearly states your query intentions, and is therefore less error-prone. There are some operators, however, that can only be written using method syntax.
Note Queries expressed using query syntax are translated by the C# compiler into method invocation form. There is no difference in runtime performance between the two forms.
The following code shows all three query forms. In the method syntax part, you might find that the parameter of the Where method looks a bit odd. It’s a lambda expression, as was described in Chapter 15. I will cover its use in LINQ a bit later in the chapter.
This code produces the following output:
2, 5, 17, 16,
2, 5, 17, 16,
4
Query Variables
LINQ queries can return two types of results.an enumeration, which lists the items that satisfy the query parameters; or a single value, called a scalar, which is some form of summary of the results that satisfied the query.
For example, the first code statement that follows returns an IEnumerable object, which can be used to enumerate the results of the query. The second statement executes a query and then calls a method (Count) that returns the count of the items returned from the query. We will cover operators that return scalars, such as Count, later in the chapter.
The variable on the left of the equals sign is called the query variable. Although the types of the query variables are given explicitly in the preceding statements, you could also have had the compiler infer the types of the query variables by using the var keyword in place of the type names.
It’s important to understand the contents of query variables. After executing the preceding code, query variable lowNums does not contain the results of the query. Instead, it contains an object of type IEnumerable
The differences in the timing of the execution of the queries can be summarized as follows:
• If a query expression returns an enumeration, the query is not executed until the enumeration is processed. If the enumeration is processed multiple times, the query is executed multiple times.
• If the query expression returns a scalar, the query is executed immediately, and the result is stored in the query variable.
Figure 21-2 illustrates this for the enumerable query. Variable lowNums contains a reference to the enumerable that can enumerate the query results from the array.
Figure 21-2. The compiler creates an object that implements IEnumerable
The Structure of Query Expressions
A query expression consists of a from clause followed by a query body, as illustrated in Figure 21-3. Some of the important things to know about query expressions are the following:
• The clauses must appear in the order shown.
– The two parts that are required are the from clause and the select...group clause.
– The other clauses are optional.
• In a LINQ query expression, the select clause is at the end of the expression. This is different than SQL, where the SELECT statement is at the beginning of a query. One of the reasons for using this position in C# is that it allows Visual Studio’s IntelliSense to give you more options while you’re entering code.
• There can be any number of from...let...where clauses, as illustrated in the figure.
Figure 21-3. The structure of a query statement consists of a from clause followed by a query body.
The from Clause
The from clause specifies the data collection that is to be used as the data source. It also introduces the iteration variable. The important points about the from clause are the following:
• The iteration variable sequentially represents each element in the data source.
• The syntax of the from clause is shown following, where
– Type is the type of the elements in the collection. This is optional, because the compiler can infer the type from the collection.
– Item is the name of the iteration variable.
– Items is the name of the collection to be queried. The collection must be enumerable, as described in Chapter 13.
The following code shows a query expression used to query an array of four ints. Iteration variable item will represent each of the four elements in the array, and will be either selected or rejected by the where and select clauses following it. This code leaves out the optional type (int) of the iteration variable.
This code produces the following output:
10, 11, 12,
The syntax of the from clause is shown in Figure 21-4. The type specifier is optional, since it can be inferred by the compiler. There can be any number of optional join clauses.
Figure 21-4. The syntax of the from clause
Although there is a strong similarity between the LINQ from clause and the foreach statement, there are several major differences:
• The foreach statement executes its body at the point in the code where it is encountered. The from clause, on the other hand, does not execute anything. It creates an enumerable object that is stored in the query variable. The query itself might or might not be executed later in the code.
• The foreach statement imperatively specifies that the items in the collection are to be considered in order, from the first to the last. The from clause declaratively states that each item in the collection must be considered, but does not assume an order.
The join Clause
The join clause in LINQ is much like the JOIN clause in SQL. If you’re familiar with joins from SQL, then joins in LINQ will be nothing new for you conceptually, except for the fact that you can now perform them on collections of objects as well as database tables. If you’re new to joins, or need a refresher, then the next section should help clear things up for you.
The first important things to know about a join are the following:
• A join operation takes two collections and creates a new temporary collection of objects, where each object contains all the fields from an object from both initial collections.
• Use a join to combine data from two or more collections.
The syntax for a join is shown here. It specifies that the second collection is to be joined with the collection in the previous clause.
Figure 21-5 illustrates the syntax for the join clause.
Figure 21-5. Syntax for the join clause
The following annotated statement shows an example of the join clause:
What Is a Join?
A join in LINQ takes two collections and creates a new collection where each element has members from the elements of the two original collections.
For example, the following code declares two classes: Student and CourseStudent.
• Objects of type Student contain a student’s last name and student ID number.
• Objects of type CourseStudent represent a student that is enrolled in a course, and contain the course name and a student ID number.
Figure 21-6 shows the situation in a program where there are three students and three courses, and the students are enrolled in various courses. The program has an array called students, of Student objects, and an array called studentsInCourses, of CourseStudent objects, which contains one object for every student enrolled in each course.
Figure 21-6. Students enrolled in various courses
Suppose now that you want to get the last name of every student in a particular course. The students array has the last names and the studentsInCourses array has the course enrollment information. To get the information, you must combine the information in the arrays, based on the student ID field, which is common to objects of both types. You can do this with a join on the StID field.
Figure 21-7 shows how the join works. The left column shows the students array and the right column shows the studentsInCourses array. If we take the first student record and compare its ID with the student ID in each studentsInCourses object, we find that two of them match, as shown at the top of the center column. If we then do the same with the other two students, we find that the second student is taking one course, and the third student is taking two courses.
The five grayed objects in the middle column represent the join of the two arrays on field StID. Each object contains three fields: the LastName field from the Students class, the CourseName field from the CourseStudent class, and the StID field common to both classes.
Figure 21-7. Two arrays of objects and their join on field StId
The following code puts the whole example together. The query finds the last names of all the students taking the history course.
This code produces the following output:
Student taking History: Carson
Student taking History: Flemming
The from . . . let . . . where Section in the Query Body
The optional from...let...where section is the first section of the query body. It can have any number of any of the three clauses that comprise it.the from clause, the let clause, and the where clause. Figure 21-8 summarizes the syntax of the three clauses.
Figure 21-8. The syntax of the from . . . let . . . where clause
The from Clause
You saw that a query expression starts with a required from clause, which is followed by the query body. The body itself can start with any number of additional from clauses, where each subsequent from clause specifies an additional source data collection and introduces a new iteration variable for use in further evaluations. The syntax and meanings of all the from clauses are the same.
The following code shows an example of this use.
• The first from clause is the required clause of the query expression.
• The second from clause is the first clause of the query body.
• The select clause creates objects of an anonymous type. I covered anonymous types earlier in the chapter, but will touch on them again shortly, describing how they are used in query expressions.
This code produces the following output:
{ a = 5, b = 6, sum = 11 }
{ a = 5, b = 7, sum = 12 }
{ a = 5, b = 8, sum = 13 }
{ a = 6, b = 6, sum = 12 }
{ a = 6, b = 7, sum = 13 }
{ a = 6, b = 8, sum = 14 }
The let Clause
The let clause takes the evaluation of an expression and assigns it to an identifier to be used in other evaluations. The syntax of the let clause is the following:
let Identifier = Expression
For example, the query expression in the following code pairs each member of array groupA with each element of array groupB. The where clause eliminates each set of integers from the two arrays where the sum of the two is not equal to 12.
This code produces the following output:
{ a = 3, b = 9, sum = 12 }
{ a = 4, b = 8, sum = 12 }
{ a = 5, b = 7, sum = 12 }
{ a = 6, b = 6, sum = 12 }
The where Clause
The where clause eliminates items from further consideration if they don’t meet the specified condition. The syntax of the where clause is the following:
where BooleanExpression
Important things to know about the where clause are the following:
• A query expression can have any number of where clauses, as long as they are in the from...let...where section.
• An item must satisfy all the where clauses to avoid elimination from further consideration.
The following code shows an example of a query expression that contains two where clauses. The where clauses eliminate each set of integers from the two arrays where the sum of the two is not greater than or equal to 11, and the element from groupA is not the value 4. Each set of elements selected must satisfy the conditions of both where clauses.
This code produces the following output:
{ a = 4, b = 7, sum = 11 }
{ a = 4, b = 8, sum = 12 }
{ a = 4, b = 9, sum = 13 }
The orderby Clause
The orderby clause takes an expression and returns the result items in order according to the expression.
The syntax of the orderby clause is shown in Figure 21-9. The optional keywords ascending and descending set the direction of the order. Expression is generally a field of the items.
• The default ordering of an orderby clause is ascending. You can, however, explicitly set the ordering of the elements to either ascending or descending, using the ascending and descending keywords.
• There can be any number of orderby clauses, and they must be separated by commas.
Figure 21-9. The syntax of the orderby clause
The following code shows an example of student records ordered by the ages of the students. Notice that the array of student information is stored in an array of anonymous types.
This code produces the following output:
Jones, Mary: 19 - History
Smith, Bob: 20 - CompSci
Fleming, Carol: 21 - History
The select . . . group Clause
There are two types of clauses that make up the select...group section—the select clause and the group...by clause. While the clauses that precede the select...group section specify the data sources and which objects to choose, the select...group section does the following:
• The select clause specifies which parts of the chosen objects should be selected. It can specify any of the following:
– The entire data item
– A field from the data item
– A new object comprising several fields from the data item (or any other value, for that matter).
• The group...by clause is optional, and specifies how the chosen items should be grouped. We will cover the group...by clause later in the chapter.
The syntax for the select...group clause is shown in Figure 21-10.
Figure 21-10. The syntax of the select . . . group clause
The following code shows an example of using the select clause to select the entire data item. First, an array of objects of an anonymous type is created. The query expression then uses the select statement to select each item in the array.
This code produces the following output:
Jones, Mary: Age 19, History
Smith, Bob: Age 20, CompSci
Fleming, Carol: Age 21, History
You can also use the select clause to choose only particular fields of the object. For example, the select clause in the following code only selects the last name of the student.
When you substitute these two statements for the corresponding two statements in the preceding full example, the program produces the following output:
Jones
Smith
Fleming
Anonymous Types in Queries
The result of a query can consist of items from the source collections, fields from the items in the source collections, or anonymous types.
You can create an anonymous type in a select clause by placing curly braces around a comma-separated list of fields you want to include in the type. For example, to make the code in the previous section select just the names and majors of the students, you could use the following syntax:
For example, the following code creates an anonymous type in the select clause, and uses it later in the WriteLine statement.
This code produces the following output:
Mary Jones -- History
Bob Smith -- CompSci
Carol Fleming -- History
The group Clause
The group clause groups the selected objects according to some criterion. For example, with the array of students in the previous examples, the program could group the students according to their majors.
The important things to know about the group clause are the following:
• When items are included in the result of the query, they are placed in groups according to the value of a particular field. The value on which items are grouped is called the key.
• Unlike the select clause, the group clause does not return an enumerable that can enumerate the items from the original source. Instead, it returns an enumerable that enumerates the groups of items that have been formed.
• The groups themselves are enumerable, and can enumerate the actual items.
An example of the syntax of the group clause is the following:
For example, the following code groups the students according to their majors:
This code produces the following output:
History
Jones, Mary
Fleming, Carol
CompSci
Smith, Bob
Figure 21-11 illustrates the object that is returned from the query expression and stored in the query variable.
• The object returned from the query expression is an enumerable that enumerates the groups resulting from the query.
• Each group is distinguished by a field called Key.
• Each group is itself enumerable and can enumerate its items.
Figure 21-11. The group clause returns a collection of collections of objects rather than a collection of objects.
Query Continuation
A query continuation clause takes the result of one part of a query and assigns it a name so that it can be used in another part of the query. The syntax for query continuation is shown in Figure 21-12.
Figure 21-12. The syntax of the query continuation clause
For example, the following query joins groupA and groupB and names the join groupAandB. It then performs a simple select from groupAandB.
This code produces the following output:
4 5 6
The Standard Query Operators
The standard query operators comprise a set of methods called an application programming interface (API) that lets you query any .NET array or collection. Important characteristics of the standard query operators are the following:
• The collection objects queried are called sequences, and must implement the IEnumerable
• The standard query operators use method syntax.
• Some operators return IEnumerable objects (or other sequences), while others return scalars. Operators that return scalars execute their queries immediately and return a value instead of an enumerable object to be iterated over later.
For example, the following code shows the use of operators Sum and Count, which return ints. Notice the following about the code:
• The operators are used as methods directly on the sequence objects, which in this case is array numbers.
• The return type is not an IEnumerable object, but an int.
This code produces the following output:
Total: 12, Count: 3
There are 47 standard query operators that fall into 14 different categories. These categories are shown in Table 21-1.
Table 21-1. Categories of the Standard Query Operators
Query Expressions and the Standard Query Operators
As mentioned at the beginning of the chapter, every query expression can also be written using method syntax with the standard query operators. The set of standard query operators is a set of methods for performing queries. The compiler translates every query expression into standard query operator form.
Clearly, since all query expressions are translated into the standard query operators—the operators can perform everything done by query expressions. But the operators also give additional capabilities that aren’t available in query expression form. For example, operators Sum and Count, which were used in the previous example, can only be expressed using the method syntax.
The two forms, query expressions and method syntax, however, can be combined. For example, the following code shows a query expression that also uses operator Count. Notice in the code that the query expression part of the statement is inside parentheses, which is followed by a dot and the name of the method.
This code produces the following output:
Count: 3
Signatures of the Standard Query Operators
The standard query operators are methods declared in class System.Linq.Enumerable. These methods, however, aren’t just any methods—they are extension methods that extend generic class IEnumerable
Extension methods were covered in Chapters 7 and 19, but the most important thing to remember about them is that they are public, static methods that, although defined in one class, are designed to add functionality to another class.the one listed as the first formal parameter. This formal parameter must be preceded by the keyword this.
For example, following are the signatures of three of the operators: Count, First, and Where. At first glance, the signatures of the operators can be somewhat intimidating. Notice the following about the signatures:
• Since the operators are generic methods, they have a generic parameter (T) associated with their names.
• Since the operators are extension methods that extend IEnumerable, they must satisfy the following syntactic requirements:
– They must be declared public and static.
– They must have the this extension indicator before the first parameter.
– They must have IEnumerable
For example, the following code shows the use of operators Count and First. Both operators take only a single parameter.the reference to the IEnumerable
• The Count operator returns a single value that is the count of all the elements in the sequence.
• The First operator returns the first element of the sequence.
The first two times the operators are used in this code, they are called directly, just like normal methods, passing the name of the array as the first parameter. In the following two lines, however, they are called using the extension method syntax, as if they were method members of the array, which is enumerable. Notice that in this case no parameter is supplied. Instead, the array name has been moved from the parameter list to before the method name. There it is used as if it contained a declaration of the method.
The direct syntax calls and the extension syntax calls are completely equivalent in effect.only their syntax is different.
This code produces the following output:
Count: 6, FirstNumber: 3
Count: 6, FirstNumber: 3
Delegates As Parameters
As you just saw in the previous section, the first parameter of every operator is a reference to an IEnumerable
• Generic delegates are used to supply user-defined code to the operator.
To explain this, I’ll start with an example showing several ways you might use the Count operator. The Count operator is overloaded and has two forms. The first form, which was used in the previous example, has a single parameter, as shown here:
public static int Count
Like all extension methods, you can use it in the standard static method form or in the form of an instance method on an instance of the class it extends, as shown in the following two lines of code:
In these two instances, the query counts the number of ints in the given integer array. Suppose, however, that you only want to count the odd elements of the array. To do that, you must supply the Count method with code that determines whether or not an integer is odd.
To do this, you would use the second form of the Count method, which is shown following. It has a generic delegate as its second parameter. At the point it is invoked, you must supply a delegate object that takes a single input parameter of type T and returns a Boolean value. The return value of the delegate code must specify whether or not the element should be included in the count.
For example, the following code uses the second form of the Count operator to instruct it to include only those values that are odd. It does this by supplying a lambda expression that returns true if the input value is odd and false otherwise. (Lambda expressions were covered in Chapter 15.) At each iteration through the collection, Count calls this method (represented by the lambda expression) with the current value as input. If the input is odd, the method returns true and Count includes the element in the total.
This code produces the following output:
Count of odd numbers: 4
The LINQ Predefined Delegate Types
Like the Count operator from the previous example, many of the LINQ operators require you to supply code that directs how the operator performs its operation. You do this by using delegate objects as parameters.
Remember from Chapter 15 that you can think of a delegate object as an object that contains a method or list of methods with a particular signature and return type. When the delegate is invoked, the methods it contains are invoked in sequence.
LINQ defines a family of five generic delegate types for use with the standard query operators. These are the Func delegates.
• The delegate objects you create for use as actual parameters must be of these five types or of these forms.
• TR represents the return type, and is always last in the list of type parameters.
The five generic delegate types are listed here. The first form takes no method parameters and returns an object of the return type. The second takes a single method parameter and returns a value, and so forth.
With this in mind, if you look again at the declaration of Count, which follows, you can see that the second parameter must be a delegate object that takes a single value of some type T as the method parameter and returns a value of type bool.
Parameter delegates that produce a Boolean value are called predicates.
Example Using a Delegate Parameter
Now that you better understand Count’s signature and LINQ’s use of generic delegate parameters, you’ll be better able to understand a full example.
The following code first declares method IsOdd, which takes a single parameter of type int and returns a bool value stating whether the input parameter was odd. Method Main does the following:
• It declares an array of ints as the data source.
• It creates a delegate object called MyDel of type Func
• It calls Count using the delegate object.
This code produces the following output:
Count of odd numbers: 4
Example Using a Lambda Expression Parameter
The previous example used a separate method and a delegate to attach the code to the operator. This required declaring the method, declaring the delegate object, and then passing the delegate object to the operator. This works fine, and is exactly the right approach to take if either of the following conditions is true:
• If the method must be called from somewhere else in the program than just in the place it is used to initialize the delegate object
• If the code in the method body is more than just a statement or two long If neither of these conditions is true, however, you probably want to use a more compact and localized method of supplying the code to the operator, using a lambda expression as described in Chapter 15.
We can modify the previous example to use a lambda expression by first deleting the IsOdd method entirely, and placing the equivalent lambda expression directly at the declaration of the delegate object. The new code is shorter and cleaner, and looks like this:
Like the previous example, this code produces the following output:
Count of odd numbers: 4
We could also have used an anonymous method in place of the lambda expression, as shown following. This is more verbose, though, and since lambda expressions are equivalent semantically and are less verbose, there’s little reason to use anonymous methods anymore.
LINQ to XML
Over the last several years, XML (Extensible Markup Language) has become an important method of storing and exchanging data. C# 3.0 adds features to the language that make working with XML much easier than previous methods such as XPath and XSLT. If you’re familiar with these methods, you might be pleased to hear that LINQ to XML simplifies the creation, traversal, and manipulation of XML in a number of ways, including the following:
• You can create an XML tree in a top-down fashion, with a single statement.
• You can create and manipulate XML in-memory without having an XML document to contain the tree.
• You can create and manipulate string nodes without having a Text sub-node.
Although I won’t give a complete treatment of XML, I will start by giving a very brief introduction to it before describing some of the XML-manipulation features introduced with C# 3.0.
Markup Languages
A markup language is a set of tags placed in a document to give information about the information in the document. That is, the markup tags are not the data of the document—they contain data about the data. Data about data is called metadata.
A markup language is a defined set of tags designed to convey particular types of metadata about the contents of a document. HTML, for example, is the most widely known markup language. The metadata in its tags contains information about how a web page should be rendered in a browser, and how to navigate among the pages using the hypertext links.
While most markup languages contain a predefined set of tags—XML contains only a few defined tags, and the rest are defined by the programmer to represent whatever kinds of metadata are required by a particular document type. As long as the writer and reader of the data agree on what the tags mean, the tags can contain whatever useful information the designers want.
XML Basics
Data in an XML document is contained in an XML tree, which consists mainly of a set of nested elements.
The element is the fundamental constituent of an XML tree. Every element has a name and can contain data. Some can also contain other, nested elements. Elements are demarcated by opening and closing tags. Any data contained by an element must be between its opening and closing tags.
• An opening tag starts with an open angle bracket, followed by the element name, followed optionally by any attributes, followed by a closing angle bracket.
• A closing tag starts with an open angle bracket, followed by a slash character, followed by the element name, followed by a closing angle bracket.
• An element with no content can be represented by a single tag that starts with an open angle bracket, followed by the name of the element, followed by a slash, and is terminated with a closing angle bracket.
The following XML fragment shows an element named EmployeeName followed by an empty element named PhoneNumber.
Other important things to know about XML are the following:
• XML documents must have a single root element that contains all the other elements.
• XML tags must be properly nested.
• Unlike HTML tags, XML tags are case sensitive.
• XML attributes are name/value pairs that contain additional metadata about an element. The value part of an attribute must always be enclosed in quotation marks, which can be either double quotation marks or single quotation marks.
• White space within an XML document is maintained. This is unlike HTML, where whitespace is consolidated to a single space in the output.
The following XML document is an example of XML that contains information about two employees. This XML tree is extremely simple in order to show the elements clearly. The important things to notice about the XML tree are the following:
• The tree contains a root node of type Employees that contains two child nodes of type Employee.
• Each Employee node contains nodes containing the name and phone numbers of an employee.
Figure 21-13 illustrates the hierarchical structure of the sample XML tree.
Figure 21-13. Hierarchical structure of the sample XML tree
The XML Classes
LINQ to XML can be used to work with XML in two ways. The first way is as a simplified XML manipulation API. The second way is to use the LINQ query facilities you’ve seen throughout the earlier part of this chapter. I’ll start by introducing the LINQ to XML API.
The LINQ to XML API consists of a number of classes that represent the components of an XML tree. The three most important classes you will use are XElement, XAttribute, and XDocument. There are other classes as well, but these are the main ones.
In Figure 21-13, you saw that an XML tree is a set of nested elements. Figure 21-14 shows the classes used to build an XML tree and how they can be nested.
For example, the figure shows the following:
• An XDocument node can have as its direct child nodes:
– At most, one of each of the following node types: an XDeclaration node, an XDocumentType node, and an XElement node
– Any number of XProcessingInstruction nodes
• If there is a top-level XElement node under the XDocument, it is the root of the rest of the elements in the XML tree.
• The root element can in turn contain any number of nested XElement, XComment, or XProcessingInstruction nodes, nested to any level.
Figure 21-14. The containment structure of XML nodes
Except for the XAttribute class, most of the classes used to create an XML tree are derived from a class called XNode, and are referred to generically in the literature as “XNodes.” Figure 21-14 shows the XNode classes in white clouds, while the XAttribute class is shown in a gray cloud.
Creating, Saving, Loading, and Displaying an XML Document
The best way to demonstrate the simplicity and usage of the XML API is to show simple code samples. For example, the following code shows how simple it is to perform several of the important tasks required when working with XML.
It starts by creating a simple XML tree consisting of a node called Employees, with two subnodes containing the names of two employees. Notice the following about the code:
• The tree is created with a single statement that in turn creates all the nested elements in place in the tree. This is called functional construction.
• Each element is created in place using an object-creation expression, using the constructor of the type of the node.
After creating the tree, the code saves it to a file called EmployeesFile.xml, using XDocument’s Save method. It then reads the XML tree back from the file using XDocument’s static Load method, and assigns the tree to a new XDocument object. Finally, it uses WriteLine to display the structure of the tree held by the new XDocument object.
This code produces the following output:
Creating an XML Tree
In the previous example, you saw that you can create an XML document in-memory by using constructors for XDocument and XElement. In the case of both constructors
• The first parameter is the name of the object.
• The second and following parameters contain the nodes of the XML tree. The second parameter of the constructor is a params parameter, and so can have any number of parameters.
For example, the following code produces an XML tree and displays it using the Console.WriteLine method:
This code produces the following output:
Using Values from the XML Tree
The power of XML becomes evident when you traverse an XML tree and retrieve or modify values. The main methods used for retrieving data are shown in Table 21-2.
Table 21-2. Methods for Querying XML
Some of the important things to know about the methods in Table 21-2 are the following:
• Nodes: The Nodes method returns an object of type IEnumerable<object>, because the nodes returned might be of different types, such as XElement, XComment, and so on. You can use the type parameterized method OfType
IEnumerable
• Elements: Since retrieving XElements is such a common requirement, there is a shortcut for expression Nodes().OfType
– Using the Elements method with no parameters returns all the child XElements.
– Using the Elements method with a single name parameter returns only the child XElements with that name. For example, the following line of code returns all the child XElement nodes with the name PhoneNumber.
IEnumerable
• Element: This method retrieves just the first child XElement of the current node. Like the Elements method, it can be called with either one or no parameters. With no parameters, it gets the first child XElement node. With a single name parameter, it gets the first child XElement node of that name.
• Descendants and Ancestors: These methods work like the Elements and Parent methods, but instead of returning the immediate child elements or parent element, they include the elements below or above the current node, regardless of the difference in nesting level.
The following code illustrates the Element and Elements methods:
This code produces the following output:
Bob Smith
408-555-1000
Sally Jones
415-555-2000
415-555-2001
Adding Nodes and Manipulating XML
You can add a child element to an existing element using the Add method. The Add method allows you to add as many elements as you like in a single method call, regardless of the node types you are adding.
For example, the following code creates a simple XML tree and displays it. It then uses the Add method to add a single node to the root element. Following that, it uses the Add method a second time to add three elements—two XElements and an XComment. Notice the results in the output:
This code produces the following output:
The Add method places the new child nodes after the existing child nodes, but you can place the nodes before and between the child nodes as well, using the AddFirst, AddBeforeSelf, and AddAfterSelf methods.
Table 21-3 lists some of the most important methods for manipulating XML. Notice that some of the methods are applied to the parent node and others to the node itself.
Table 21-3. Methods for Manipulating XML
Working with XML Attributes
Attributes give additional information about an XElement node. They are placed in the opening tag of the XML element.
When you functionally construct an XML tree, you can add attributes by just including XAttribute constructors within the scope of the XElement constructor. There are two forms of the XAttribute constructor; one takes a name and a value, and the other takes a reference to an already existing XAttribute.
The following code adds two attributes to root. Notice that both parameters to the XAttribute constructor are strings; the first specifies the name of the attribute, and the second gives the value.
This code produces the following output. Notice that the attributes are placed inside the opening tag of the element.
To retrieve an attribute from an XElement node, use the Attribute method, supplying the name of the attribute as the parameter. The following code creates an XML tree with a node with two attributes—color and size. It then retrieves the values of the attributes and displays them.
This code produces the following output:
color is red
size is large
To remove an attribute, you can select the attribute and use the Remove method, or use the SetAttributeValue method on its parent and set the attribute value to null. The following code demonstrates both methods:
This code produces the following output:
To add an attribute to an XML tree or change the value of an attribute, you can use the SetAttributeValue method, as shown in the following code:
This code produces the following output:
Other Types of Nodes
Three other types of nodes used in the previous examples are XComment, XDeclaration, and XProcessingInstruction. They are described in the following sections.
XComment
Comments in XML consist of text between the tokens. The text between the tokens is ignored by XML parsers. You can insert text in an XML document using the XComment class, as shown in the following line of code:
new XComment("This is a comment")
XDeclaration
XML documents start with a line that includes the version of XML used, the type of character encoding used, and whether or not the document depends on external references. This is called the XML declaration, and is inserted using the XDeclaration class. The following shows an example of an XDeclaration statement:
new XDeclaration("1.0", "utf-8", "yes")
XProcessingInstruction
An XML processing instruction is used to supply additional data about how an XML document should be used or interpreted. Most commonly, processing instructions are used to associate a style sheet with the XML document.
You can include a processing instruction using the XProcessingInstruction constructor, which takes two string parameters—a target and a data string. If the processing instruction takes multiple data parameters, those parameters must be included in the second parameter string of the XProcessingInstruction constructor, as shown in the following constructor code. Notice that in this example, the second parameter is a verbatim string, and literal double quotes inside the string are represented by sets of two contiguous double quote marks.
new XProcessingInstruction( "xml-stylesheet",
@"href=""stories"", type=""text/css""")
The following code uses all three constructs:
This code produces the following output in the output file. Using a WriteLine of xd, however, would not show the declaration statement, even though it is included in the document file.
Using LINQ Queries with LINQ to XML
You can combine the LINQ XML API with LINQ query expressions to produce simple yet powerful XML tree searches.
The following code creates a simple XML tree, displays it to the screen, and then saves it to a file called SimpleSample.xml. Although there’s nothing new in this code, we’ll use this XML tree in the following examples.
This code produces the following output:
The following example code uses a simple LINQ query to select a subset of the nodes from the XML tree, and then displays them in several ways. This code does the following:
• It selects from the XML tree only those elements whose names have five characters. Since the names of the elements are first, second, and third, only node names first and third match the search criterion, and therefore those nodes are selected.
• It displays the names of the selected elements.
• It formats and displays the selected nodes, including the node name and the values of the attributes. Notice that the attributes are retrieved using the Attribute method, and the values of the attributes are retrieved with the Value property.
This code produces the following output:
first
third
Name: first, color: red, size: small
Name: third, color: blue, size: large
The following code uses a simple query to retrieve all the top-level elements of the XML tree, and creates an object of an anonymous type for each one. The first use of the WriteLine method shows the default formatting of the anonymous type. The second WriteLine statement explicitly formats the members of the anonymous type objects.
This code produces the following output. The first three lines show the default formatting of the anonymous type. The last three lines show the explicit formatting specified in the format string of the second WriteLine method.
{ Name = first, color = color="red" }
{ Name = second, color = color="red" }
{ Name = third, color = color="blue" }
first , color: red
second, color: red
third , color: blue
From these examples you can see that you can easily combine the XML API with the LINQ query facilities to produce powerful XML querying capabilities.