Thursday, March 12, 2015

Weblogic OAM Identity Assertion problem - JXDocumentBuilderFactory not found

Environment / assumptions in this post:

  • Weblogic 10.3.1
  • Alfresco 3.4
  • OAM / OID


There is a valuable lesson here, which could be summarized as:
don't configure anything between one middleware home and another middleware home

Or even more generally:
don't point to a file or folder that could be changed unexpectedly by something else

Duh.  But these things can come about innocently enough.

Problem Description

We have Alfresco (ECM) deployed to WLS 10.3.1 (yes, very old now!), with OAM Asserter and OID Authenticator in place. At one point, users started experiencing intermittent problems logging in. After authentication, the Alfresco Dashboard would sometimes not come up and an error page would be displayed.

Log files showed this each time:
javax.xml.parsers.FactoryConfigurationError: Provider oracle.xml.jaxp.JXDocumentBuilderFactory not found
   at javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:174)
   at oracle.security.wls.oam.util.OAMIdentityAssertion.(OAMIdentityAssertion.java:170)
   at oracle.security.wls.oam.providers.asserter.OAMIdentityAssertionProviderImpl.updateCallbackHandler(OAMIdentityAssertionProviderImpl.java:791)
   at oracle.security.wls.oam.providers.asserter.OAMIdentityAssertionProviderImpl.headerBasedAssertion(OAMIdentityAssertionProviderImpl.java:991)
   at oracle.security.wls.oam.providers.asserter.OAMIdentityAssertionProviderImpl.assertIdentity(OAMIdentityAssertionProviderImpl.java:658)
Truncated. see log file for complete stacktrace

So a couple questions came up right away:
1. How is it that JXDocumentBuilderFactory can be "not found" intermittently? 
2. What environment change caused this?

Very hard questions to answer. After any patching or upgrading, thorough testing is done on all applications and systems and no problems like this every came up.

I had a hunch that the JXDocumentBuilderFactory was a bit of a red herring, so focused on the next class mentioned in the stack trace: OAMIdentityAssertion. After searching through all jar files with something like this:

for x in $(find . -name "*.jar"); do echo $x; jar tvf $x | grep OAMIdentityAssertion; done

...I came across several references, but only in a different middleware home than where Alfresco was running.  Huh?  That's when I found the managed server JVM startup switch pointing to a different place. 

Problem Solution

We ended up changing the following WLS startup switch:
-Dweblogic.alternateTypesDirectory=.......
...to point to OAM libraries under a WLS 10.3.5 install rather than the current WLS 10.3.6. The 10.3.6 was upgraded in place at some point, which is likely the reason of eventual failure.

Actually, the real root cause is a bit complicated. We installed Alfresco on WLS 10.3.1, originally using OSSO. When we eventually migrated to OAM, installed in another middleware home, we needed to use the OAM Asserter, but the code didn't exist in 10.3.1 so we had to point to where it did. We should have properly upgraded the Alfresco WLS instance, but we didn't want to "kick the sleeping dog". Alfresco was running fine, but we had had some issues and the feeling among management was to not disturb it - we would leave the upgrade for another day. 

So a couple lessons learned:
  • if the sleeping dog needs kicking, then kick it. :) In other words, don't put off upgrading too long
  • don't point to something outside of your own "sandbox", even if it is unlikely to change