PowerShell saves the day: Repairing a corrupt Excel XML document

I received a helpdesk ticket today from a user who was having trouble opening an Excel workbook.

The original file was saved with a .XLS extension, but when I attempted to open it, I got this warning:

image

Clicking Yes resulted in this next error, and the document failed to load:

image

Upon inspection, I discovered that the file was actually saved in “XML Data” format. Strange, but apparently this user’s office used to use OpenOffice back in the day, and this document was created back in 2004 if the XML is to be believed.

image

I first tried stripping out just the <Style></Style> portion of the XML, but that then caused errors as most of the cells within the workbook had a ‘ss:Style=’ element. This meant that I had to go and strip out all of these elements, which was next to impossible to do manually.

Here’s how I ended up doing it:

# Path to the corrupted file
$file = "C:temporiginal.xml"

# Read the file in as a System.Xml.XmlDocument
[xml]$content = Get-Content $file

# Remove all styles from the document
$content.Workbook.GetElementsByTagName("Styles").RemoveAll()

# Clear the style attributes from cells, rows, and columns
$content.GetElementsByTagName("Cell") | % {$_.RemoveAttribute("ss:StyleID")}
$content.GetElementsByTagName("Row") | % {$_.RemoveAttribute("ss:StyleID")}
$content.GetElementsByTagName("Column") | % {$_.RemoveAttribute("ss:StyleID")}

# Grab some details about the original file name and where it's stored
$folder = $file | Split-Path -Parent
$fileBaseName = [System.IO.Path]::GetFileNameWithoutExtension($file)
$fileExtension = [System.IO.Path]::GetExtension($file)

# Save a new version of the original .xml file with a "-repaired" suffix
$content.Save("$folder$fileBaseName-repaired$fileExtension")

I’m sure that this isn’t optimal. I tried to find a better way to get all of the elements within the XML document, rather than repeat myself (see lines 11-13), but I ran out of time to play with it.

Regardless of that, the script will strip out all styles, remove the style information from each of the cells/rows/columns, and then save the modified document with an amended filename.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s