Home > Code > C# > How to Search, Modify & Replace All Hyperlinks in a Word Document using .NET

How to Search, Modify & Replace All Hyperlinks in a Word Document using .NET

by sher azam   on Apr 15, 2015   Category: C#   |  Views: 1551    |  Points: 25   |  Gold 


This technical tip explains how .NET developers can find and modify all hyperlinks in a Word document. To find and modify hyperlinks it would be nice to have some sort of Hyperlink object with properties, but in the current version, there is no built-in functionality in Aspose.Words to deal with hyperlink fields. Hyperlinks in Microsoft Word documents are fields. A field consists of the field code and field result. In the current version of Aspose.Words, there is no single object that represents a field. Aspose.Words represents a field by a set of nodes: [FieldStart] , one or more [Run] nodes of the field code, [FieldSeparator] , one or more Run nodes of the field result and [FieldEnd]. While Aspose.Words does not have a high-level abstraction to represent fields and hyperlink fields in particular, all of the necessary low-level document elements and their properties are exposed and with a bit of coding you can implement quite sophisticated document manipulation features. This example shows how to create a simple class that represents a hyperlink in the document. Its constructor accepts a FieldStart object that must have [FieldType.FieldHyperlink] type. After you use the Hyperlink class, you can get or set its Target , Name , and IsLocal properties. Now it is easy to change targets and names of the hyperlinks throughout the document. In the example, all of the hyperlinks are changed to “http://aspose.com”. Finds all hyperlinks in a Word document and changes their URL and display name.

Code samples for Find and modify all Hyperlinks in a Word document

//[C# Code Sample]

using System;
using System.Text;
using System.Text.RegularExpressions;
using Aspose.Words;
using Aspose.Words.Fields;

namespace Examples
{
/// <summary>
/// Shows how to replace hyperlinks in a Word document.
/// </summary>
public class ExReplaceHyperlinks : ExBase
{
/// <summary>
/// Finds all hyperlinks in a Word document and changes their URL and display name.
/// </summary>
public void ReplaceHyperlinks()
{
// Specify your document name here.
Document doc = new Document(MyDir + "ReplaceHyperlinks.doc");

// Hyperlinks in a Word documents are fields, select all field start nodes so we can find the hyperlinks.
NodeList fieldStarts = doc.SelectNodes("//FieldStart");
foreach (FieldStart fieldStart in fieldStarts)
{
if (fieldStart.FieldType.Equals(FieldType.FieldHyperlink))
{
// The field is a hyperlink field, use the "facade" class to help to deal with the field.
Hyperlink hyperlink = new Hyperlink(fieldStart);

// Some hyperlinks can be local (links to bookmarks inside the document), ignore these.
if (hyperlink.IsLocal)
continue;

// The Hyperlink class allows to set the target URL and the display name
// of the link easily by setting the properties.
hyperlink.Target = NewUrl;
hyperlink.Name = NewName;
}
}

doc.Save(MyDir + "ReplaceHyperlinks Out.doc");
}

private const string NewUrl = @"http://www.aspose.com";
private const string NewName = "Aspose - The .NET & Java Component Publisher";
}


/// <summary>
/// This "facade" class makes it easier to work with a hyperlink field in a Word document.
///
/// A hyperlink is represented by a HYPERLINK field in a Word document. A field in Aspose.Words
/// consists of several nodes and it might be difficult to work with all those nodes directly.
/// Note this is a simple implementation and will work only if the hyperlink code and name
/// each consist of one Run only.
///
/// [FieldStart][Run - field code][FieldSeparator][Run - field result][FieldEnd]
///
/// The field code contains a string in one of these formats:
/// HYPERLINK "url"
/// HYPERLINK \l "bookmark name"
///
/// The field result contains text that is displayed to the user.
/// </summary>
internal class Hyperlink
{
internal Hyperlink(FieldStart fieldStart)
{
if (fieldStart == null)
throw new ArgumentNullException("fieldStart");
if (!fieldStart.FieldType.Equals(FieldType.FieldHyperlink))
throw new ArgumentException("Field start type must be FieldHyperlink.");

mFieldStart = fieldStart;

// Find the field separator node.
mFieldSeparator = FindNextSibling(mFieldStart, NodeType.FieldSeparator);
if (mFieldSeparator == null)
throw new InvalidOperationException("Cannot find field separator.");

// Find the field end node. Normally field end will always be found, but in the example document
// there happens to be a paragraph break included in the hyperlink and this puts the field end
// in the next paragraph. It will be much more complicated to handle fields which span several
// paragraphs correctly, but in this case allowing field end to be null is enough for our purposes.
mFieldEnd = FindNextSibling(mFieldSeparator, NodeType.FieldEnd);

// Field code looks something like [ HYPERLINK "http:\\www.myurl.com" ], but it can consist of several runs.
string fieldCode = GetTextSameParent(mFieldStart.NextSibling, mFieldSeparator);
Match match = gRegex.Match(fieldCode.Trim());
mIsLocal = (match.Groups[1].Length > 0); //The link is local if \l is present in the field code.
mTarget = match.Groups[2].Value;
}

/// <summary>
/// Gets or sets the display name of the hyperlink.
/// </summary>
internal string Name
{
get
{
return GetTextSameParent(mFieldSeparator, mFieldEnd);
}
set
{
// Hyperlink display name is stored in the field result which is a Run
// node between field separator and field end.
Run fieldResult = (Run)mFieldSeparator.NextSibling;
fieldResult.Text = value;

// But sometimes the field result can consist of more than one run, delete these runs.
RemoveSameParent(fieldResult.NextSibling, mFieldEnd);
}
}

/// <summary>
/// Gets or sets the target url or bookmark name of the hyperlink.
/// </summary>
internal string Target
{
get
{
string dummy = null; // This is needed to fool the C# to VB.NET converter.
return mTarget;
}
set
{
mTarget = value;
UpdateFieldCode();
}
}

/// <summary>
/// True if the hyperlink's target is a bookmark inside the document. False if the hyperlink is a url.
/// </summary>
internal bool IsLocal
{
get
{
return mIsLocal;
}
set
{
mIsLocal = value;
UpdateFieldCode();
}
}

private void UpdateFieldCode()
{
// Field code is stored in a Run node between field start and field separator.
Run fieldCode = (Run)mFieldStart.NextSibling;
fieldCode.Text = string.Format("HYPERLINK {0}\"{1}\"", ((mIsLocal) ? "\\l " : ""), mTarget);

// But sometimes the field code can consist of more than one run, delete these runs.
RemoveSameParent(fieldCode.NextSibling, mFieldSeparator);
}

/// <summary>
/// Goes through siblings starting from the start node until it finds a node of the specified type or null.
/// </summary>
private static Node FindNextSibling(Node startNode, NodeType nodeType)
{
for (Node node = startNode; node != null; node = node.NextSibling)
{
if (node.NodeType.Equals(nodeType))
return node;
}
return null;
}

/// <summary>
/// Retrieves text from start up to but not including the end node.
/// </summary>
private static string GetTextSameParent(Node startNode, Node endNode)
{
if ((endNode != null) && (startNode.ParentNode != endNode.ParentNode))
throw new ArgumentException("Start and end nodes are expected to have the same parent.");

StringBuilder builder = new StringBuilder();
for (Node child = startNode; !child.Equals(endNode); child = child.NextSibling)
builder.Append(child.GetText());

return builder.ToString();
}

/// <summary>
/// Removes nodes from start up to but not including the end node.
/// Start and end are assumed to have the same parent.
/// </summary>
private static void RemoveSameParent(Node startNode, Node endNode)
{
if ((endNode != null) && (startNode.ParentNode != endNode.ParentNode))
throw new ArgumentException("Start and end nodes are expected to have the same parent.");

Node curChild = startNode;
while ((curChild != null) && (curChild != endNode))
{
Node nextChild = curChild.NextSibling;
curChild.Remove();
curChild = nextChild;
}
}

private readonly Node mFieldStart;
private readonly Node mFieldSeparator;
private readonly Node mFieldEnd;
private bool mIsLocal;
private string mTarget;

/// <summary>
/// RK I am notoriously bad at regexes. It seems I don't understand their way of thinking.
/// </summary>
private static readonly Regex gRegex = new Regex(
"\\S+" + // one or more non spaces HYPERLINK or other word in other languages
"\\s+" + // one or more spaces
"(?:\"\"\\s+)?" + // non capturing optional "" and one or more spaces, found in one of the customers files.
"(\\\\l\\s+)?" + // optional \l flag followed by one or more spaces
"\"" + // one apostrophe
"([^\"]+)" + // one or more chars except apostrophe (hyperlink target)
"\"" // one closing apostrophe
);
}
}

// [Visual Basic Code Sample]


Imports Microsoft.VisualBasic
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Imports Aspose.Words
Imports Aspose.Words.Fields

Namespace Examples
''' <summary>
''' Shows how to replace hyperlinks in a Word document.
''' </summary>
<TestFixture> _
Public Class ExReplaceHyperlinks
Inherits ExBase
''' <summary>
''' Finds all hyperlinks in a Word document and changes their URL and display name.
''' </summary>
<Test> _
Public Sub ReplaceHyperlinks()
' Specify your document name here.
Dim doc As New Document(MyDir & "ReplaceHyperlinks.doc")

' Hyperlinks in a Word documents are fields, select all field start nodes so we can find the hyperlinks.
Dim fieldStarts As NodeList = doc.SelectNodes("//FieldStart")
For Each fieldStart As FieldStart In fieldStarts
If fieldStart.FieldType.Equals(FieldType.FieldHyperlink) Then
' The field is a hyperlink field, use the "facade" class to help to deal with the field.
Dim hyperlink As New Hyperlink(fieldStart)

' Some hyperlinks can be local (links to bookmarks inside the document), ignore these.
If hyperlink.IsLocal Then
Continue For
End If

' The Hyperlink class allows to set the target URL and the display name
' of the link easily by setting the properties.
hyperlink.Target = NewUrl
hyperlink.Name = NewName
End If
Next fieldStart

doc.Save(MyDir & "ReplaceHyperlinks Out.doc")
End Sub

Private Const NewUrl As String = "http://www.aspose.com"
Private Const NewName As String = "Aspose - The .NET & Java Component Publisher"
End Class


''' <summary>
''' This "facade" class makes it easier to work with a hyperlink field in a Word document.
'''
''' A hyperlink is represented by a HYPERLINK field in a Word document. A field in Aspose.Words
''' consists of several nodes and it might be difficult to work with all those nodes directly.
''' Note this is a simple implementation and will work only if the hyperlink code and name
''' each consist of one Run only.
'''
''' [FieldStart][Run - field code][FieldSeparator][Run - field result][FieldEnd]
'''
''' The field code contains a string in one of these formats:
''' HYPERLINK "url"
''' HYPERLINK \l "bookmark name"
'''
''' The field result contains text that is displayed to the user.
''' </summary>
Friend Class Hyperlink
Friend Sub New(ByVal fieldStart As FieldStart)
If fieldStart Is Nothing Then
Throw New ArgumentNullException("fieldStart")
End If
If (Not fieldStart.FieldType.Equals(FieldType.FieldHyperlink)) Then
Throw New ArgumentException("Field start type must be FieldHyperlink.")
End If

mFieldStart = fieldStart

' Find the field separator node.
mFieldSeparator = FindNextSibling(mFieldStart, NodeType.FieldSeparator)
If mFieldSeparator Is Nothing Then
Throw New InvalidOperationException("Cannot find field separator.")
End If

' Find the field end node. Normally field end will always be found, but in the example document
' there happens to be a paragraph break included in the hyperlink and this puts the field end
' in the next paragraph. It will be much more complicated to handle fields which span several
' paragraphs correctly, but in this case allowing field end to be null is enough for our purposes.
mFieldEnd = FindNextSibling(mFieldSeparator, NodeType.FieldEnd)

' Field code looks something like [ HYPERLINK "http:\\www.myurl.com" ], but it can consist of several runs.
Dim fieldCode As String = GetTextSameParent(mFieldStart.NextSibling, mFieldSeparator)
Dim match As Match = gRegex.Match(fieldCode.Trim())
mIsLocal = (match.Groups(1).Length > 0) 'The link is local if \l is present in the field code.
mTarget = match.Groups(2).Value
End Sub

''' <summary>
''' Gets or sets the display name of the hyperlink.
''' </summary>
Friend Property Name() As String
Get
Return GetTextSameParent(mFieldSeparator, mFieldEnd)
End Get
Set(ByVal value As String)
' Hyperlink display name is stored in the field result which is a Run
' node between field separator and field end.
Dim fieldResult As Run = CType(mFieldSeparator.NextSibling, Run)
fieldResult.Text = value

' But sometimes the field result can consist of more than one run, delete these runs.
RemoveSameParent(fieldResult.NextSibling, mFieldEnd)
End Set
End Property

''' <summary>
''' Gets or sets the target url or bookmark name of the hyperlink.
''' </summary>
Friend Property Target() As String
Get
Dim dummy As String = Nothing ' This is needed to fool the C# to VB.NET converter.
Return mTarget
End Get
Set(ByVal value As String)
mTarget = value
UpdateFieldCode()
End Set
End Property

''' <summary>
''' True if the hyperlink's target is a bookmark inside the document. False if the hyperlink is a url.
''' </summary>
Friend Property IsLocal() As Boolean
Get
Return mIsLocal
End Get
Set(ByVal value As Boolean)
mIsLocal = value
UpdateFieldCode()
End Set
End Property

Private Sub UpdateFieldCode()
' Field code is stored in a Run node between field start and field separator.
Dim fieldCode As Run = CType(mFieldStart.NextSibling, Run)
fieldCode.Text = String.Format("HYPERLINK {0}""{1}""", (If((mIsLocal), "\l ", "")), mTarget)

' But sometimes the field code can consist of more than one run, delete these runs.
RemoveSameParent(fieldCode.NextSibling, mFieldSeparator)
End Sub

''' <summary>
''' Goes through siblings starting from the start node until it finds a node of the specified type or null.
''' </summary>
Private Shared Function FindNextSibling(ByVal startNode As Node, ByVal nodeType As NodeType) As Node
Dim node As Node = startNode
Do While node IsNot Nothing
If node.NodeType.Equals(nodeType) Then
Return node
End If
node = node.NextSibling
Loop
Return Nothing
End Function

''' <summary>
''' Retrieves text from start up to but not including the end node.
''' </summary>
Private Shared Function GetTextSameParent(ByVal startNode As Node, ByVal endNode As Node) As String
If (endNode IsNot Nothing) AndAlso (startNode.ParentNode IsNot endNode.ParentNode) Then
Throw New ArgumentException("Start and end nodes are expected to have the same parent.")
End If

Dim builder As New StringBuilder()
Dim child As Node = startNode
Do While Not child.Equals(endNode)
builder.Append(child.GetText())
child = child.NextSibling
Loop

Return builder.ToString()
End Function

''' <summary>
''' Removes nodes from start up to but not including the end node.
''' Start and end are assumed to have the same parent.
''' </summary>
Private Shared Sub RemoveSameParent(ByVal startNode As Node, ByVal endNode As Node)
If (endNode IsNot Nothing) AndAlso (startNode.ParentNode IsNot endNode.ParentNode) Then
Throw New ArgumentException("Start and end nodes are expected to have the same parent.")
End If

Dim curChild As Node = startNode
Do While (curChild IsNot Nothing) AndAlso (curChild IsNot endNode)
Dim nextChild As Node = curChild.NextSibling
curChild.Remove()
curChild = nextChild
Loop
End Sub

Private ReadOnly mFieldStart As Node
Private ReadOnly mFieldSeparator As Node
Private ReadOnly mFieldEnd As Node
Private mIsLocal As Boolean
Private mTarget As String

''' <summary>
''' RK I am notoriously bad at regexes. It seems I don't understand their way of thinking.
''' </summary>
Private Shared ReadOnly gRegex As New Regex("\S+" & "\s+" & "(?:""""\s+)?" & "(\\l\s+)?" & """" & "([^""]+)" & """" )
End Class
End Namespace

More about Aspose.Words for .NET

Aspose.Words is a word processing component that enables Java & .NET applications to read, write and modify Word documents without using Microsoft Word. Other useful features include document creation, content and formatting manipulation, mail merge abilities, reporting features, TOC updated/rebuilt, Embedded OOXML, Footnotes rendering and support of DOCX, DOC, WordprocessingML, HTML, XHTML, TXT and PDF formats (requires Aspose.Pdf). It supports both 32-bit and 64-bit operating systems. You can even use Aspose.Words to build applications with Mono.

- Homepage of Aspose.Words for .NET: http://www.aspose.com/.net/word-component.aspx

- Download Aspose.Words for .NET: http://www.aspose.com/community/files/51/.net-components/aspose.words-for-.net/default.aspx



Post Code  |  Code Snippet Home

User Responses


No response found, be the first to review this code snippet.

Submit feedback about this code snippet

Please sign in to post feedback

Latest Posts