RSS

Author Archives: Sean McElroy

About Sean McElroy

I am a project and technical resource manager with a wealth of technical expertise, including application development, enterprise software architecture, and database architecture and design, capacity planning, and change control. I have specific knowledge in online, hosted .NET applications in an ASP/SaaS business model, user experience analysis and design, and integration of platforms using service-oriented architecture using Microsoft technologies.

A Brief Introduction to Part-of-Speech Tagging

A field of computer science that has captured my attention lately is computational linguistics — the inexact science of how to get a computer to understand what you mean.  This could be something as futuristic as Matthew Broderick’s battle with the WOPR, or with something more practical, like Siri.  Whether it be text entered by a human into a keyboard or something more akin to understanding the very unstructured format of human speech, understanding the meaning behind parsed words is incredibly complex — and to someone like me — fascinating!

My particular interest as of late is parsing — which from a linguistic perspective, means the breaking down of a string of characters into words, their meanings, and stringing them together in a parse tree, where the meanings of individual words as well as the relationships between words is composed into a logical construct that allows higher order functions, such as a personal assistant.  Having taken several foreign language classes before, then sitting on the other side of the table as an ESL teacher, I can appreciate the enormous ambiguity and complexity of any language, and much more so English among Germanic languages, as to creating an automated process to parse input into meaningful logical representations.  Just being able to discern the meaning of individual words given the multitude of meanings that can be ascribed to any one sequence of characters is quite a challenge.

Parsing Models

Consider this:  My security beat wore me out tonight.

In this sentence, what is the function of the word beat?  Beat functions as either a noun or a verb, but in this context, it is a noun.  There are two general schools of thought around assigning a tag as to what part of speech (POS) each word in a sentence functions as — iterative rules-based methods and stochastic methods.  In rules-based methods, like Eric Brill’s POS tagger, a priority-based set of rules that set forth language-specific axioms, such as “when a word appears to be a preposition, it is actually a noun if the preceding word is while”.  A complex set of these meticulously constructed conditions is used to refine a more course dictionary-style assignment of POS tags.

Stochastic methods, however, are more “fuzzy” methods of building advanced statistical models of how words should be tagged not based on a procedural and manual analysis of edge cases and their mitigations, but using training models over pre-tagged corpra, in a manner hearkening to the training sets applied to neural networks.  These trained models are then used as a baseline for assigning tags to incoming text, but no notable option for correction of any specific error or edge case other than retraining the entire model is available for refinement.  One such very interesting concept is treating the tagging of parts of speech as Hidden Markov Models, which is a probabilistic model that strives to explain how a process with a defined pattern that is not known other than sparse characteristics of the model and the inputs and the outputs through the process.

This continues to be a good candidate for doctorial theses in computer science disciplines.. papers that have caused me to lose too much sleep as of late.

Parsing Syntax

Even describing parts of speech can be as mundane as your elementary school grammar book, or as rich as the C7 tagset, which provides 146 unique ways to describe a word’s potential function.  While exceptionally expressive and specific, I have become rather fond of the Penn Treebank II tagset, which defines 45 tags that seem to provide enough semantic context for the key elements of local pronoun resolution and larger-scale object-entity context mapping.  Finding an extensively tagged Penn Treebank corpus proves difficult, however, as it is copyright by the University of Pennsylvania, distributed through a public-private partnership for several thousand dollars, and the tagged corpus is almost exclusively a narrow variety of topics and sentence structures — Wall Street Journal articles.  Obtaining this is critical to use as a reference check for writing a new Penn Treebank II part-of-speech tagger, and it prevents the construction of a more comprehensive Penn-tagged wordlist, which would be a boon for any tagger implementation.  However, the folks at the NLTK has provided a 10% free sample under Fair Use that has provided somewhat useful for both checking outputs in a limited fashion, but also for generating some more useful relative statistics about relationships between parts of speech within a sentence.

To produce some rudimentary probabilistic models to guide ambiguous POS-mappings for individual words, I wrote a five-minute proof of concept that scanned the NLTK-provided excerpt of the WSJ Penn Treebranch corpus to produce probabilities of what the next word’s part of speech would be given the previous word’s tag. The full results are available in this gist.

Future Musings

My immediate interest, whenever I get some free time on a weekend (which is pretty rare these days due to the exceptional pace of progress at our start-up), is pronoun resolution, which is the object of this generation’s Turing Test — the Winograd Schemas.  An example of such a challenge is to get a machine to answer this kind of question — Joe’s uncle can still beat him at tennis, even though he is 30 years older. Who is older? This kind of question is easy for a human to answer, but very, very hard for a machine to infer because (a) it can’t cheat to Google a suitable answer, which some of the less impressive Turing Test contestant programs now do, and (b) it requires not only the ability to successfully parse a sentence into its respective parts of speech, phrases, and clauses, but it requires the ability for a computer to resolve the meaning of a pronoun.  That’s an insanely tough feat!  Imagine this:

“Annabelle is a mean-spirited person.  She shot my dog out of spite.”

A program could infer “my dog” is a dog belonging to the person providing the text.  This has obvious applications in the real world if you can do this, and it has been done before.  But, imagine the leap in context that is exponentially harder to overcome when resolving “She”.  This requires not only an intra-sentence relationship of noun phrases, possessive pronouns, direct objects, and adverbial clauses, but it also requires the ability to carry context forward from one sentence to the next, building a going “mental map” of people, places, things — and building a profile of them as more information or context is provided.  And, if you think that’s not hard enough to define .. imagine the two additional words appended on to this sentence:

, she said.

That would to a human indicate dialog, which requires a wholly separate frame of Inception-style reference between contextual frames.  The parser is reading text about things which is actually being conveyed by other things — both sets of frames have their own unique, but not necessarily separate, domains and attributes.  I’m a very long-way off from ever getting this diversion in my “free time” anywhere close to functioning as advertised… but, then again, that’s what exercises on a weekend are for — not doing, but learning. 🙂

 
2 Comments

Posted by on August 22, 2013 in Programming

 

Robustness in Programming

(For my regular readers, I know I promised this post would detail ‘a method by which anyone could send me a message securely, without knowing anything else about me other than my e-mail address, in a way I could read online or my mobile device, in a way that no one can subpoena or snoop on in between.’  A tall order, for sure, but still something I am working to complete in an RFC format.  In the meantime…)

I have the benefit of supporting an engineering group that is seeing tremendous change and growth well past ideation and proof of concept, but at the validation and scaling phases of a product timeline.  One observation I’ve made about the many lessons taught and learned as part of this company and product growth spurt have been the misapplication of the Jon Postel’s Robustness Principle.  Many technical folks are at least familiar with, but often can quote the adage: “Be conservative in what you do, be liberal in what you accept from others“.  Unfortunately, like many good pieces of advice, this is taken out of context when it relates to software development.

First off, robustness, while it sounds positive, it not a trait you always want.  This can be confusing for the uninitiated, considering antonyms of the word include “unfitness” and “weakness”.  On a macro-scale, you want a system to be robust; you a product to be robust.  However, if you decompose an enterprise software solution into its components, and those pieces into their individual parts, the concerns do not always need to, and in some cases should not, be robust.

For instance, should a security audit log be robust?  Imagine a highly secure software application that must carefully log each access attempt to the system.  This system is probably designed so that many different components of the system can write data to this log, and imagine the logging system is simple and writes its output to a file.  If this particular part of the system were robust, as many developers define it, it must, as well as possible, attempt to accept and log any messages posted to it.  However, implemented this way, it is subject to CRLF attacks, whereby a component that can connect to it and insert a delimiter that would allow it to add false entries to the security log.  Of course, you developers say, you need to do input checking and not allow such a condition to pass through to the log.  I would go much further and state you must be as meticulous as possible about parsing and throwing exceptions or raising errors for as many conditions as possible.  Each exception that is not thrown is an implicit assumption, and assumptions are the root cause of 9 out the OWASP Top 10 vulnerabilities in web applications.

Robustness can, and is often, an excuse predicated by laziness.  Thinking about edge cases and about the assumptions software developers make with each method they write is tedious.  It is time consuming.  It does not advance a user story along its path in an iteration.  It adds no movement towards delivering functionality to your end users.  Recognizing and mitigating your incorrect assumptions, however, is an undocumented but critical requirement for the development of every piece of a system that does store, or may ever come in contact with, protected information.  Those that rely on the Robustness Principle must not interpret “liberal” to mean “passive” or “permissive”, but rather “extensible”.

In the previous example I posited about a example logging system, consider how such a system could remove assumptions but still be extensible.  The number and format of each argument that comprises a log entry should be carefully inspected – if auditing text must be descriptive, then shouldn’t such a system reject a zero or two-character event description?  While information systems should be localizable and multilingual, shouldn’t all logs be written in one language and any characters that are not of that language omitted and unique system identifiers within the log languages’ character set used instead?  If various elements are co-related, such as an account number and a username, shouldn’t they be checked for an association instead of blindly accepting them as stated by the caller?  If the log should be chronological, shouldn’t an event specified in the future or too far in the past be rejected?  Each of these leading questions exposes a vulnerability a careful assessment of input checking can address, but which is wholly against most developers’ interpretations of the Robustness Principle.

However, robustness is not about taking whatever is given to you, it is about very carefully checking what you get, and if and only if it passes a litany of qualifying checks, accepting it as an answer to an open-ended question, rather than relying on a defined set of responses, when possible.  A junior developer might enumerate all the error states he or she can imagine in a set list or “enum”, and only accept that value as valid input to a method.  While that’s a form of input checking, it is wholly inextensible, as the next error state any other contributor wishes to add will require a recompile/redeploy of the logging piece, and potentially every other consumer of that component.  Robustness need not require all data be free-form, it must simply be written with foresight.

Postel, wrote his “law” with reference to TCP implementations, but he never suggested that TCP stack implementers liberally accept TCP segments with such boundless blitheness that they infer the syntax of whatever bits they received, but rather, they should not impose an understanding of the data elements that were not pertinent to the task at hand, nor enforce one specific interpretation of a specification upon upstream callers.  And therein lies my second point — robustness is not about disregarding syntax, but about imposing a convention.  Robust systems must fail as early and as quickly as possible when syntax, especially, has been violated or cannot be accurately and unambiguously interpreted, or if the context or state of a system is deemed to be invalid for the operation.  For instance, if a receives a syntactically valid message but can determine the context is wrong, such as a request for information from a user who lacks an authorization to that data, every conceivable permutation of invalid context should be checked, not fail to consider each in a blasé fashion to leave room for a future feature that may, someday, require an assumption made in the present, if it is ever to be developed.  This crosses another threshold beyond extensibility to culpable disregard.

In conclusion, building a robust system requires discretion in interpretation of programming “laws” and “axioms”, and an expert realization that no one-liner assertions were meant by their authors as principles so general to apply to every level of technical scale of the architecture and design of a system.  To those who would disagree with me, I would say, then to be “robust” yourself, you have to accept my argument. 😉

 
Leave a comment

Posted by on August 7, 2013 in Programming

 

When All You See Are Clouds… A Storm Is Brewing

The recent disclosures that the United States Government has violated the 4th amendment of the U. S. Constitution and potentially other international law by building a clandestine program that provides G-Men at the NSA direct taps into every aspect of our digital life – our e-mail, our photos, our phone calls, our entire relationships with other people and even with our spouses, is quite concerning from a technology policy perspective.  The fact that the US Government (USG) can by legal authority usurp any part of our recorded life – which is about every moment of our day – highlights several important points to consider:

  1. Putting the issue of whether the USG/NSA should have broad access into our lives aside, we must accept that the loopholes that allow them to demand this access expose weaknesses in our technology.
  2. The fact the USG can perform this type of surveillance indicates other foreign governments and non-government organizations likely can and may already be doing so as well.
  3. Given that governments are often less technologically savvy though much more resource-rich than malevolent actors, if data is not secure from government access, is it most definitely not secure from more cunning hackers, identity thieves, and other criminal enterprises.

If we can accept the points above, then we must accept that the disclosure of PRISM and connotation through carefully but awkwardly worded public statements about the program present both a problem and an opportunity for technologists to solve regarding data security in today’s age.  This is not a debate of whether we have anything to hide, but rather a discussion of how can we secure data, because if we cannot secure it from a coercive power (sovereign or criminal), we have no real data security at all.

But before proposing some solutions, we must consider:

How Could PRISM Have Happened in the First Place?

I posit an answer devoid of politics or blame, but on an evaluation of the present state of Internet connectivity and e-commerce.  Arguably, the Internet has matured into a stable, reliable set of services.  The more exciting phase of its development saw a flourishing of ideas much like a digital Cambrian explosion.  In its awkward adolescence, connecting to the Internet was akin to performing a complicated rain dance that involved WinSock, dial-up modems, and PPP, sprinkled with roadblocks like busy signals, routine server downtime, and blue screens of death.  The rate of change in equipment, protocols, and software was meteoric, and while the World Wide Web existed (what most laypeople consider wholly as “the Internet” today), it was only a small fraction of the myriad of services and channels for information to flow.  Connecting to and using the Internet required highly specialized knowledge, which both increased the level of expertise of those developing for and consuming the Internet, while limiting its adoption and appeal – a fact some consider the net’s Golden Age.

But as with all complex technologies, eventually they mature.  The rate of innovation slows down as standardization becomes the driving technological force, pushed by market forces.  As less popular protocols and methods of exchanging information give way to young but profitable enterprises that push preferred technologies, the Internet became a much more homogeneous experience both in how we connect to and interact with it.  This shapes not only the fate of now-obsolete tech, such as UUCP, FINGER, ARCHIE, GOPHER, and a slew of other relics of our digital past, but also influenced the very design of what remains — a great example being identification and encryption.

For the Internet to become a commercializable venue, securing access to money, from online banking to investment portfolio management, to payments, was an essential hurdle to overcome.  The solution for the general problem of identity and encryption, centralized SSL certificate authorities providing assurances of trust in a top-down manner, solves the problem specifically for central server webmasters, but not for end-users wishing to enjoy the same access to identity management and encryption technology.  So while the beneficiaries like Amazon, eBay, PayPal, and company now had a solution that provided assurance to their users that you could trust their websites belonged to them and that data you exchanged with them was secure, end-users were still left with no ability to control secure communications or identify themselves with each other.

A final contributing factor I want to point out is that other protocols drifted into oblivion, more functionality was demanded over a more uniform channel — the de facto winner becoming HTTP and the web.  Originally a stateless protocol designed for minimal browsing features, the web became a solution for virtually everything, from e-mail (“webmail”), to searching, to file storage (who has even fired up an FTP client in the last year?).  This was a big win for service providers, as they, like Yahoo! and later Google, could build entire product suites on just one delivery platform, HTTP, but it was also a big win for consumers, who could throw away all their odd little programs that performed specific tasks, and could just use their web browser for everything — now even Grandma can get involved.  A more rich offering of single-shot tech companies were bought up or died out in favor of the oligarchs we know today – Microsoft, Facebook, Google, Twitter, and the like.

Subtly, this also represented a huge shift on where data is stored.  Remember Eudora or your Outlook inbox file tied to your computer (in the days of POP3 before IMAP was around)?  As our web browser became our interface to the online world, and as we demanded anywhere-accessibility to those services and they data they create or consume, those bits moved off our hard drives and into the nebulous service provider cloud, where data security cannot be guarenteed.

This is meaningful to consider in the context of today’s problem because:

  1. Governments and corporate enterprises were historically unable to sufficiently regulate, censor, or monitor the internet because they lacked the tools and knowledge to do so.  Thus, the Internet had security through obscurity.
  2. Due to the solutions to general problems around identity and encryption relying on central authorities,  malefactors (unscrupulous governments and hackers alike) have fewer targets to influence or assert control over to tap into the nature of trust, identity, and communications.
  3. With the collapse of service providers into a handful of powerful actors on a scale of inequity on par with a collapse of wealth distribution in America, there exist now fewer providers to surveille to gather data, and those providers host more data on each person or business that can be interrelated in a more meaningful way.
  4. As information infrastructure technology has matured to provide virtual servers and IaaS offerings on a massive scale, fewer users and companies deploy controlled devices and servers, opting instead to lease services from cloud providers or use devices, like smartphones, that wholly depend upon them.
  5. Because data has migrated off our local storage devices to the cloud, end-users have lost control over their data’s security.  Users have to choose between an outmoded device-specific way to access their data, or give up the control to cloud service providers.

There Is A Better Way

Over the next few blog posts, I am going to delve into a number of proposals and thoughts around giving control and security assurances of data back to end-users.  These will address points #2 and #4 above as solutions that layer over existing web technologies, not proposals to upend our fundamental usage of the Internet by introducing opaque configuration barriers or whole-new paradigms.  End-users should have choice whether their service providers have access to their data in a way that does not require Freenet’s darknets or Tor’s game-of-telephone style of anonymous but slow onion-routing answer to web browsing.  Rather, users should be able to positively identify themselves to the world and be able to access and receive data and access it in a cloud-based application without ever having to give up their data security, not have to trust of the service provider, be independent to access the data on any devices (access the same service securely anywhere), and not have to establish shared secrets (swap passwords or certificates).

As a good example, if you want to send a secure e-mail message today, you have three categorical options to do so:

  1. Implicitly trust a regular service provider:  Ensure both the sender and the receiver use the same server.  By sending a message, it is only at risk while the sender connects to the provider to store it and while the receiver connects the provider to retrieve it.  Both parties trust the service provider will not access or share the information.  Of course, many actors, like Gmail, still do.
  2. Use a secure webmail provider:  These providers, like Voltage.com, encrypt the sender’s connection to the service to protect the message as it is sent, and send notifications to receivers to come to a secure HTTPS site to view the message.  While better than the first option, the message is still stored in a way that can be demanded by subpoena or snooped inside the company while it sits on their servers.
  3. Use S/MIME certificates and an offline mail client:  While the most secure option for end-to-end message encryption, this cumbersome method is machine-dependent and requires senders and receivers to first share a certificate with each other – something the average user is flatly incapable of understanding or configuring.

Stay tuned to my next post, where I propose a method by which anyone could send me a message securely, without knowing anything else about me other than my e-mail address, in a way I could read online or my mobile device, in a way that no one can subpoena or snoop on in between.

 

 
 

Tags: ,

Doing Your Due Diligence on Security Scanning and Penetration Testing Vendors

All too often, development shops and IT professionals become complacent with depending on packaged scanning solutions or a utility belt of tools to provide security assurance testing of a hosted software solution.  In the past five years, a number of new entrants to the security evaluation and penetration testing market have created some compelling cloud-based solutions to perimeter testing.  These tools, while exceptionally useful for a sanity check of firewall rules, load balancer configurations, and even certain industry best practices in web application development, are starting to create a false sense of security in a number of ways.  As these tools proliferate, infrastructure professionals are becoming increasingly dependent upon their handsomely-crafted reporting about PCI, GLBA, SOX, HIPPA, and all the other regulatory buzzwords that apply to certain industries.  If you’re using these tools, have you considered:

Do you use more than one tool?  If not, and you should, is there any actual overlap between their testing criteria?

There is a certain incestuous phenomenon that develops in any SaaS industry that sees high profit margins: entrepreneurs perceive cloud-based solutions as having a low barrier to entry.  This perception drives new market entrants to cobble together solutions to compete for share in the space.  But are these fly-by-night competitors competitively differentiated from their peers?

Sadly, I have found in practical experience this not to be the case.  Too many times have I have enrolled in a free trial of a tool or actually shelled out for some hot new cloud-based scanning solution to find at best only existing known vulnerabilities are duplicatively reported by this new ‘solution’, with only false positives appearing as the ‘net new’ items to bring to my attention.  Here in lies the rub — when new entrants to this market create competing products, there is an iterative reverse engineering that goes on — they run existing scanning products on the market against websites, check to see those results, and make sure they develop a solution that at least identifies the same issues.

That’s not good at all.  In any given security scan, you may see, perhaps, 20% of the total vulnerabilities a product is capable of finding show up as a problem in a scan target.  Even if you were to scan multiple targets, you may only be seeing mostly the same kinds of issues in each subsequent scan.  Those using this as a methodology to build quick-to-market security scanning solutions are delivering sub-par offerings that may only identify 70% of the vulnerabilities other scanning solutions do.  eEye has put together similar findings in an intriguing report I highly recommend reading.  Investigating the research and development activities of a security scanning provider is an important due diligence step to make sure when you get an “all clear” clean report from a scanning tool, that report actually means something.

How do you judge your security vendor in this regard?  Ask for a listing of all specific vulnerabilities they scan for.  Excellent players in this market will not flinch at giving you this kind of data for two reasons: (1) a list of what they check for isn’t as important as how well and how thoroughly they actually assess each item, and (2) worthwhile vendors are constantly adding new items to the list, so it doesn’t represent any static master blueprint for their product.

Does your tool test more than OWASP vulnerabilities?

The problem with developing security testing tools is in part the over-reliance on the standardization of vulnerability definition and classifications.  While it is helpful to categorize vulnerabilities into conceptually similar groups to create common mitigation strategies and mitigation techniques, too often security vendors focus on OWASP attack classifications as the definitive scope for probative activities.  Don’t get me wrong, these are excellent guides for ensuring the most common types of attacks are covered, but they do not provide a comprehensive test of application security.  Too often the types of testing such as incremental information disclosure, where various pieces of the system provide information that can be used to discern how to attack the system further, are relegated to manual penetration testing instead of codified into scanning criteria.  Path disclosure and path traversal vulnerabilities are a class of incremental information disclosures that are routinely tested for by scanning tools, but they represent only a file-system basis test for this kind of security problem instead of part of a larger approach to the problem through systematic scanning.

Moreover, SaaS providers should consider DoS/DDoS weaknesses as security problems, not just customer relationship or business continuity problems.  These types of attacks can cripple a provider and draw their technical talent to the problem at hand, mitigating the denial of service attack.  During those periods, this can and has recently been used in high-profile fake-outs to either generate so much trash traffic that other attacks and penetrations are difficult to perceive or react to, or to create opportunities for social engineering attacks to succeed with less sophisticated personnel while the big-guns are trying to tackle the bigger attacks.  Until weaknesses that can allow for high-load to easily take down a SaaS application are included as part of vulnerability scanning, this will remain a serious hole in the testing methodology of a security scanning vendor.

So, seeing CVE identifiers and OWASP classifications for reported items is nice from a reporting perspective, and it gives a certain credence to mitigation reports to auditors, but don’t let those lull you into a false sense of security coverage.  Ask your vendor what other types of weaknesses and application vulnerabilities they test for outside of the prescribed standard vulnerability classifications.  Otherwise, you will potentially shield yourself from “script kiddies”, but leave yourself open to targeted attacks and advanced persistent threats that have created embarrassing situations for a number of large institutions in the past year.

What is your mobile strategy?

Native mobile applications are the hot-stuff right now.  Purists tout the HTML5-only route to mobile application development, but mobile web development alone isn’t enough to satisfy Apple to get access to the iOS platform, (since 2008) and consumers still can detect a web app that is merely a browser window and prefer the feature set that comes from native applications, including camera access, accelerometer data, and usage of the physical phone buttons into application navigation.  The native experience is still too nice to pass up to be at the head-of-the-class in your industry.

If you’re a serious player in the SaaS market, you have or will soon have a native mobile application or hybrid-native deliverable. If you’re like most other software development shops, mobile isn’t your forte, but you’ve probably hired specific talent with a mobile skill set to realize whatever your native strategy is.  Are your architects and in-house security professionals giving the same critical eye to native architecture, development, and code review as they are to your web offering?  If you’re honest, the answer is: probably not.

The reason your answer is ‘probably not’ is because it is a whole different technology stack, set of development languages, and testing methodology where the tools you invested in to secure your web application do not apply to your native application development.  This doesn’t mean your native applications are not vulnerable, it means they’re vulnerable in different ways that you don’t even know or are testing for yet.  This should be a wake-up call for enterprise software shops: because a vulnerability exists only on a native platform does not mitigate its seriousness.  It is trivial to spin up a mobile emulator to host a native application and use the power of a desktop or server to exploit that vulnerability on a scale that could cripple a business through disclosure or denial of service.

Your native mobile security scanning strategy should minimally cover two important surface areas:

1. Vulnerabilities in the way the application stores data on the device in memory and on any removable media

2. Vulnerabilities in the underlying API serving the native application

If you’re not considering these, then you probably have not selected a native application security scanning tool checking for these either.

In Conclusion

Security is always a moving target, as fluid as the adaptiveness of the techniques of attackers and the rapid pace of change in technologies they attack.  Don’t treat security scanning and penetration testing as a checklist item for RFP’s or to address auditor’s concerns — understand the surface areas, and understanding the failings of security vendors’ products.  Understand your assessments are valid only in the short-term, and re-evaluation of your vendor mix and their offerings on a continual basis is crucial.  Only then will you be informed and able to make the right decisions to be proactive, instead of reactive, regarding the sustainability of your business.

 
Leave a comment

Posted by on May 29, 2013 in Security

 

Thwarting SSL Inspection Proxies

A disturbing trend in corporate IT departments everywhere is the introduction of SSL inspection proxies.  This blog post explores some of the ethical concerns about such proxies and proposes a provider-side technology solution to allow clients to detect their presence and alert end-users.  If you’re well-versed in concepts about HTTPS, SSL/TLS, and PKI, please skip down to the section entitled ‘Proposal’.

For starters, e-commerce and many other uses of the public Internet are only possible because the capability for encryption of messages to exist.  The encryption of information across the World Wide Web is possible through a suite of cryptography technologies and practices known as Public Key Infrastructure (PKI).  Using PKI, servers can offer a “secure” variant of the HTTP protocol, abbreviated as HTTPS.  This variant itself encapsulates other application level protocols, like HTTP, using a transport-layer protocol called Secure Socket Layer (SSL), which as since been superseded by a similar, more secure version, Transport Layer Security (TLS).  Most users of the Internet are familiar with the symbolism common with such secure connections: when a user browses a webpage over HTTPS, usually some visual iconography (usually a padlock) as well as a stark change in the presentation of the page’s location (usually a green indicator) show the end-user that the page was transmitted over HTTPS.

SSL/TLS connections are protected in part by a server certificate stored on the web server.  Website operators purchase these server certificates from a small number of competing companies, called Certificate Authorities (CA’s), that can generate them.  The web browsers we all use are preconfigured to trust certificates that are “signed” by a CA.  The way certificates work in PKI allows certain certificates to sign, or vouch for, other certificates.  For example, when you visit Facebook.com, you see your connection is secure, and if you inspect the message, you can see the server certificate Facebook presents is trusted because it is signed by VeriSign, and VeriSign is a CA that your browser trusts to sign certificates.

So… what is an SSL Inspection Proxy?  Well, there is a long history of employers and other entities using technology to do surveillance of the networks they own.  Most workplace Internet Acceptable Use Policies state clearly that the use of the Internet using company-owned machine and company-paid bandwidth is permitted only for business use, and that the company reserves the right to enforce this policy by monitoring this use.  While employers can easily review and log all unencrypted that flows over their networks, that is any request for a webpage and the returned rendered output, the increasing prevalence of HTTPS as a default has frustrated employers in recent years.  Instead of being able to easily monitor the traffic that traverses their networks, they have had to resort to less-specific ways to infer usage of secure sites, such as DNS recording.

(For those unaware and curious, the domain-name system (DNS) allows client computers to resolve a URL’s name, such as Yahoo.com, to its IP address, 72.30.38.140.  DNS traffic is not encrypted, so a network operator can review the requests of any computers to translate these names to IP addresses to infer where they are going.  This is a poor way to survey user activity, however, because many applications and web browsers do something called “DNS pre-caching”, where they will look up name-to-number translations in advance to quickly service user requests, even if the user hasn’t visited the site before.  For instance, if I visited a page that had a link to Playboy.com, even if I never click the link, Google Chrome may look up that IP address translation just in case I ever do in order to look up the page faster.)

So, employers and other network operators are turning to technologies that are ethically questionable, such as Deep Packet Inspection (DPI), which looks into all the application traffic you send to determine what you might be doing, to down right unethical practices of using SSL Inspection Proxies.  Now, I concede I have an opinion here, that SSL Inspection Proxies are evil.  I justify that assertion because an SSL Inspection Proxy causes your web browser to lie to it’s end-user, giving them a false assertion of security.

What exactly are SSL Inspection Proxies?  SSL Inspection Proxies are servers setup to execute a Man-In-The-Middle (MITM) attack on a secure connection, on behalf of your ISP or corporate IT department snoops.  When such a proxy exists on your network, when you make a secure request for https://www.google.com, the network redirects your request to the proxy.  The proxy then makes a request to https://www.google.com for you, returns the results, and then does something very dirty — it creates a lie in the form of a bogus server certificate.  The proxy will create a false certificate for http://www.google.come, sign it with a different CA it has in its software, and hand the response back.  This “lie” happens in two manners:

  1. The proxy presents itself as the server you request, instead of the actual server you requested.
  2. The proxy states the certificate handed back with the page response is a different one than what was actually handed back by that provider, http://www.google.com in this case.

This interchange would look like this:

It sounds strange to phrase the activities of your own network as an “attack”, but this type of interaction is precisely that, and it is widely known in the network security industry as a MITM attack.  As you can see, a different certificate is handed back to the end-user’s browser than what http://www.example.com in the above image.  Why?  Well, each server certificate that is presented with a response is used to encrypt that data.  Server certificates have what is called a “public key” that everyone knows which unique identifies the certificate, and they also have a “private key”, known only by the web server in this example.  A public key can be used to encrypt information, but only a private key can decrypt it.  Without an SSL Inspection Proxy, that is, what normally happens, when you make a request to http://www.example.com, example.com first sends back the public key of the server certificate for its server to your browser.  Your browser uses that public key to encrypt the request for a specific webpage as well as a ‘password’ of sorts, and sends that back to http://www.example.com.  Then, the server would use its private key to decrypt the request, process it, then use that ‘password’ (called a session key) to send back an encrypted response.  That doesn’t work so well for an inspection proxy, because this SSL/TLS interchange is designed to thwart any interloper from being able to intercept or see the data transmitted back and forth.

The reason an SSL Inspection Proxy sends a different certificate back is so it can see the request the end-user’s browser is making so it knows what to pass on to the actual server as it injects itself as a proxy to this interchange.  Otherwise, once the request came to the proxy, the proxy could not read it, because the proxy wouldn’t have http://www.example.com’s private key.  So, instead, it generates a public/private key and makes it appear like it is http://www.example.com’s server certificate so it can act on its behalf, and then uses the actual public key of the real server certificate to broker the request on.

Proposal

The reason an SSL Inspection Proxy can even work is because it signs a fake certificate it creates on-the-fly using a CA certificate trusted by the end user’s browser.  This, sadly, could be a legitimate certificate (called a SubCA certificate), which would allow anyone who purchases a SubCA certificate to create any server certificate they wanted to, and it would appear valid to the end-user’s browser.  Why?  A SubCA certificate is like a regular server certificate, except it can also be used to sign OTHER certificates.  Any system that trusts the CA that created and signed the SubCA certificate would also trust any certificate the SubCA signs.  Because the SubCA certificate is signed by, let’s say, the Diginotar CA, and your web browser is preconfigured to trust that CA, your browser would accept a forged certificate for http://www.example.com signed by the SubCA.  Thankfully, SubCA’s are frowned upon and increasingly difficult for any organization to obtain because they do present a real and present danger to the entire certificate-based security ecosystem.

However, as long as the MITM attacker (or, your corporate IT department, in the case of an SSL Inspection Proxy scenario) can coerce your browser to trust the CA used by the proxy, then the proxy can create all the false certificates it wants, sign it with the CA certificate they coerced your computer to trust, and most users would never notice the difference.  All the same visual elements of a secure connection — the green coloration, the padlock icon, and any other indicators made by the browser, would be present.  My proposal to thwart this:

Website operators should publish a hash of the public key of their server certificates (the certificate thumbprint) as a DNS record.  For DNS top-level domains (TLD’s) that are protected with DNSSEC, as long as this DNS record that contains the has for http://www.example.com is cryptographically signed, the corporate IT department of local clients nor a network operator could forge a certificate without creating a verifiable breach that clients could check for and then warn to end users.  Of course, browsers would need to be updated to do this kind of verification in the form of a DNS lookup in conjunction with the TLS handshake, but provided their resolvers checked for an additional certificate thumbprint DNS record anyway, this would be a relatively trivial enhancement to make.

EDIT: (April 15, 2013): There is in fact an IETF working group now addressing this proposal, very close to my original proposal! Check out the work of the DNS-based Authentication of Named Entities (DANE) group here: http://datatracker.ietf.org/wg/dane/ — on February 25, they published a working draft of this proposed resolution as the new “TLSA” record.  Great minds think alike. 🙂

 
2 Comments

Posted by on September 15, 2012 in Ethical Concerns, Open Standards, Privacy, Security

 

Tags:

CNN Lies to Every One of Its Web Viewers

When is it okay to flat out lie to your users?  I would argue: Never.  But the website of one of the world’s most watched sources of news, CNN, does just that.

Near the bottom of every article is a section called “We recommend” and “From around the web”.  These sections list about six links to other articles either on CNN itself, other Turner properties, or simply as a paid referral service for selected partners.  So what’s my beef with this?  It’s not the targeted marketing, it’s the outright lie I noticed they make when you hover over any of those links with your mouse.

For some background, I’m a huge dissident against outbound link tracking.  It’s fundamentally the same as gluing a GPS tracking device to your forehead and giving a a tracking device to the website you’re visiting.  I have a problem with it because I think there is a fundamental freedom that is eroded by this technology – the freedom to consume information without being tracked for doing so.  Do I have the right to pick up a magazine and browse through it without giving someone my telephone number?  I would say yes — I think it is a natural right to be able to consume information without having your consumption observed.

But my belief here isn’t realistic — tracking basic visitor behavior and consumer preferences is the basic monetization and sustainability model for most of the Web as we know it.  So, this world doesn’t mesh with my perfect world, but at least I should know if someone is observing my behavior, right?  Observing CNN’s privacy policy one can clearly see the word “link” is referenced twice, once in relation to third-party sites that may cookie you, and once for integration to social media or other partner sites that may have differing privacy policies.

Okay, fair enough, therefore I should expect that if I am surfing just CNN’s website, if I disable cookies, and if I turn on my do not track header, I should expect not to be tracked, right?  No, and the reason is I cannot find out when I’m still on the CNN site to only stay within it.  The reason is CNN has specifically coded it’s site to lie to me about when I’m staying within it or navigating away.  For an example, if I were to hover over one example link in these two sections, I see the following in my browser status bar:

http://www.cnn.com/2012/07/15/sport/jason-kidd-arrested/index.html

I right-clicked the link in Chrome and copied the URL.  Then curiously I noticed the link read differently in the browser status bar when hovering over it, this time reading:

http://traffic.outbrain.com/network/redir?key=ad68e2a0a57f3eb04e4553bf2e80b6b2&rdid=349349184&type=MVLVS_d/t1_ch&in-site=false&req_id=968ab83e0a0f44e584d8744520d2aea0&agent=blog_JS_rec&recMode=4&reqType=1&wid=100&imgType=0&refPub=0&prs=true&scp=false&version=59070&idx=3

Youch, what’s that, and why did it change?  On closer inspection, by viewing the source of the page, I can see the target href of the link is exactly as reproduced above, going to traffic.outbrain.com.  I peeked at some other URL’s in the same section that I had not yet left-clicked or right-clicked and noticed this:

<a target=”_self” href=”http://www.cnn.com/2012/07/15/sport/jason-kidd-arrested/index.html&#8221; onmousedown=”this.href=’http://traffic.outbrain.com/network/redir?key=10b8398e7c07227c8a8786b1682f1707&amp;rdid=349349184&amp;type=WMV_d/t1_ch&amp;in-site=false&amp;req_id=968ab83e0a0f44e584d8744520d2aea0&amp;agent=blog_JS_rec&amp;recMode=4&amp;reqType=1&amp;wid=100&amp;imgType=0&amp;refPub=0&amp;prs=true&amp;scp=false&amp;version=59070&amp;idx=4&#8242;;return true;” onclick=”javascript:return(true)”>Knicks’ Jason Kidd arrested on suspicion of DWI</a>

And herein is the deception — this piece of inline JavaScript code changes the target of the link at the moment it is clicked to go to the traffic.outbrain.com address.  Because target href originally reads to the final destination of the article, hovering over it gives the false impression that my click will directly take me to it.  Instead, at the moment I click it, the target href is changed to the potentially unscrupulous third-party, and I have been given no browser notification this would happen prior to my click, and upon traffic.outbrain.com responding, it redirects me back to the original CNN article I initially wanted to view.  On a broadband connection, you probably wouldn’t even notice the superfluous page load and redirect back to CNN’s site.  Deceptive!

So, sure, why should anyone care?  Isn’t this just plumbing, technology, and toolbox of tricks inherit of the Web?  Maybe, but the problem here is the lie.  You do not lie to your users.  Ever.  Outbound web tracking is not a web beacon.  Web beacons are a different kind of “evil” – usually some JavaScript that opens an IFRAME to a third-party site that issues a cookie to track you; however, web beacons are covered by CNN’s privacy policy, so if they were equivalent, it’s all fair.  Web beacons can be simply disabled by turning off third-party cookies in today’s browsers.  This is precisely why outbound link tracking is becoming popular – it circumvents the privacy management tools most users have available and have knowledge of.  Outbound link tracking is no more insidious than web beacons are, but the implementation of them often lies to the end user about what their action will do (a click in this case).  An honest implementation would be to either clearly state in the privacy policy that any links you click may be link tracked or simply not to deceive the user by rewriting the target href the moment they click it to actually go to the link tracking site so the browser status bar is truthful on hover (Twitter’s t.co strategy).

Well, at least it’s just CNN at fault here.  At least no one else would stoop to such shady tactics.  Surely not Google (/url) or Facebook (l.php).. no, definitely not…

 
 

Tags:

The Cost of Speed

First off, I’m quite dissatisfied with my work.

But then again, isn’t every architect?  No matter how fantastically we break down and lay out complex enterprise systems, there’s always something to be dissatisfied with even the best logical designs, physical hardware, business logic, and user experiences.  We know well enough that enterprise software development is never complete.  Sure, user stories and discrete tasks can be marked “complete” in an issue tracking system, but large enterprise systems are virtual organisms that can be endlessly extended, refined, and improved upon.  There is no finish line, but rather a multidimensional cube of gradients where each metric of success is defined and measured by different stakeholders. So, when I state I’m dissatisfied with my work, that’s not a state of being, it’s an acknolwedgement that architecting and developing these systems is a continuum of satisficing stakeholders, not a process that is ever truley complete.  We should be dissatisfied, because if we are not, we are complacent.

Measurements of Success

However, just because the composition of large and complex systems has no discrete end, it doesn’t mean success cannot be measured.  There are a ton of metrics that can be derived to have some meaning to various parties in an ISV and the client ecosystem, some of which have meaning, and some of which can be predictors of success.  When I look at a system, I intrinsically think about the technical metrics first – the layers of indirection, query costs, how chatty an interface is, cyclomatic complexity, interface definitions, the segregation of responsibility, patterns that are reusable and durable from one set of developers to the next, et cetera.  But architects must understand that while these metrics do play a role in the ultimate success, re-usability, and appeal of a solution, they are not the same metrics a business user — usually those who define success at a more meaningful level for going concern of sustainable business — will consider.  Instead, these technical metrics contribute to other metrics that are the ultimate way in which a product’s success will be measured and judged.  Specifically, there are only three things that executive offices, sales, and prospects care about:

  1. What does the system do?  (What are the features and benefits?)
  2. What does the system look like when it does it?  (What’s the visual user experience?)
  3. How fast does the system do it?

Not that absent from that list is a metric worded like “How does the system do it?”  Inevitably the ‘how’ question is part of large Requests For Proposal (RFP’s), but in my experiences, at the end of the day, those questions are mere pass-fail criteria that rarely play into an actual purchase decision or a contract renewal decision.  Quite often both junior and senior developers, and many times even management fails to keep this in perspective.  If a solution can demonstrate what it does — and what it does is what a customer needs it to do, that it does it in a pleasing way, and that it does it fast, users are satisfied.

That last item, “How fast does the system do it?”, seems out of place, doesn’t it?   Now any whiney sales guy (I used to work with a lot of them, thankfully we have an awesome team where I’m at now) can tell you how a sluggish-feeling web page can tank a demo or blame a two second render time for a bacon he didn’t bring home last quarter, and cloistered developers are used to brushing off those comments.  They really shouldn’t.  Speed directly determines the success of a product in three ways:

Users who have a slow experience are less likely to start to use the product

KISSmetrics put together a fantastic infographic on this subject that shows how page abandonment is affected by web page load times.

And let’s not fool ourselves — just because your product is served on an intranet, not for the fickle consumption of the B2C public Internet, your users are no yes fickle or demanding.  Nor are you immune to this phenomenon because you utilize native clients or rich internet applications (RIA’s) to provide your product or service.  Users will abandon your way to access their data if it’s too slow, even if you might think they are a captive audience.  For instance, in a world where data liberation is a real and powerful force — where users demand to export their data from your system to use the interface of their choice, or even worse, where users demand you provide API’s to your data so they can use your competitor’s user interface — no audience is captive.  Even worse for those of you providing a B2C public Internet service, page load times play into search engine optimization (SEO) ranking algorithms, meaning a slow slight is less likely to even enter the consciousness of prospects who depend on a search engine to scope their perception of available services.

Users who have a slow experience are less likely to continue using a product

Let’s say you’ve enticed users with all your wonderful functionality and a slick Web 2.0 (I hate that term, for the record) user interface to visit your site, perhaps even sign-up and take it for a spin.  Most developers fail to realize that a clunky web browsing experience in an application doesn’t just temporarily frustrate users, it affects their psychological perceptions about the credibility of your product (Fogg et al. 2001) as well as the quality of the service (Bouch, Kuchinsky, and Bhatti 2000).  In one case which analyzed a large data site of an e-commerce site, a one second delay in page loads reduced customer conversion rates by 7%.

The above graphic is a visualization of a behavior model by BJ Fogg of Stanford University about how users motivation and ability create a threshold to take action, and what triggers a product can use to entice users to cross that threshold depending on their position along this action boundary.  Truly fascinating stuff, but to distill it down into the context of this blog post — the marketing of your product and the value proposition of your service should be creating a high motivation for your end users.  What a shame then, if users never take action to use your product because you failed to reduce barriers to usage, reducing the ability and increasing complexity because your site was sluggish.  Crossing that boundary is one hurdle to cross, but ISV’s have the ability to move the boundary in the way the market, design, and implement the product.

The Cost-Speed Curvature

Okay, okay, you got it, right?  The product needs to be fast.  But how fast is fast enough?  You can find studies from the late 1990’s that say 8-10 seconds is the gold standard.  But back in reality, our expectations are closer to the 2-3 second threshold.  The wiggle room is admittedly extraordinarily small in this minuscule window: it doesn’t accept any excuses due to the slow rendering speeds of ancient computers or low-powered mobile devices that might be using your site, the client’s low bandwidth, or buffer bloat in each piece of equipment between your server’s network card and your end user’s.  Not to mention, most sites aren’t simply delivering static, cache-able content.   They’re hitting web farms of web servers behind load balancers, often using a separate caching instance, subject to the disk and network I/O of a database server and any number of components in between to execute potentially long-running processes — all of which need to happen in a manner that still provides the perception of a speedy user experience.

Now, exactly how to get your product or service faster isn’t my concern, and it’s highly dependent on exactly what you do and exactly how you do it — your technology stack and specific infrastructure decisions.  What I can tell you though is you need an answer to your executive suite, board, or ultimate impatient user who, no matter how performant (or not) your system is, asks, “How can we make this faster?”  This answer shouldn’t be quantitative, as in, “We can shave 4 seconds off if we do Enhancement X, which will take two weeks”, unless you want to hear your words parroted back to you when you can’t deliver such an unrealistic expectation.  Even if you have an amazing amount of profiled data points about each component of your system, quantifying improvements is a mental exercise with little predictable result in enterprise solutions.

Why?

Well, in any serious enterprise software solution, there is obviously code you didn’t write and pieces you didn’t architect.  Even if you were Employee #1, and not inheriting a mess by a predecessor team or architect, inevitably you’re using multiple black boxes in your interconnected system in the form of code libraries.  Even if you’re a big FOSS proponent and can technically look at any of the source code for those libraries, face it, in a real business you never will have the time to do so, if the nerdy interest.  While you can sample the inputs and outputs into each of those closed systems, you can predict but you cannot quantify how changing an input will affect the performance of a closed system creating an output.  Don’t try it, you will fail.

Instead, remember my opening paragraph — performance optimization, much like “feature completeness”, is not a goal, it is a process that is continual over the life of the product.  Obviously, developers start this process Googling StackOverflow et al. for “slow IOC startup” or “IIS memory issues in WCF services” or whatever the issue is with your particular technology stack, and will review the “me too” comments to see if they too did a “me too” misconfiguration or misdesign.  Maybe it’s “whoops, forgot to turn on web server GZIP compression” or “whoops, forgot to turn off debug symbols when I compile”.  Typically, these are low-hanging fruit — low risk to affect change with a high potential impact.  But eventually you run out of simple “whoops!” Eureka moments or answers to simple questions, and you end up having to ask harder questions that have fewer obvious answers, thus requiring time spent specifically on researching those answers and developing solutions in-house.  When you think about it, there’s a real escalating cost for each unit of performance gain over the lifetime of the product for this very reason.  Graphed as a curve, I’ll call it the Marginal Cost of Speed:

And this is, in fact, a reality that must be thoroughly understood inside a development team all the way up through the executive suite.  Not dissimilar to how Einstein postulated the only way to achieve infinite speed was to harness infinite energy, the only way to get an instance page-load or a zero-latency back-end process completion is by spending an infinite amount of resources achieving that goal.  I say this has to be understood at the development team level mostly because you will never, no matter how pragmatic and persuasive you are, convince the executive suite or the customer that you in fact cannot repeat the last thing you did that doubled performance, because the further you go down the performance optimization road, the narrower and longer it gets between mile markers.  The development team needs to fully understand what constitutes low-hanging fruit and must have their efforts focused on those simple changes that affect the greatest change first, and not tackle such problems with an instinctive impulse to refactor.

Likewise, the executive and marketing teams need to understand the development of a lightning fast product is a last-mile problem, that reaching that nirvana will require an increasing amount of time (cost) and resources (cost) to achieve it.  The effort is an exercise in satisficing the parameters to find an acceptable middle-ground.  Usually, though, the realities of product development aren’t treated the same as the realities of other externally-governed factors, simply because they are perceived not to be governed by any absolutes since they are not external.  Put another way, customers of Amazon.com might abandon the site because shipping times for purchases are too long, but the company can’t just start comp’ing overnight service for everyone.  Well, they could do so, but the cost to acquire that customer just skyrocketed to a level that makes their business model unsustainable.  Similarly, the time spent on performance optimization has a real and measurable cost, and it can actually be quantified as a cost to acquire and retain a customer when you think about how a performant site directly impacts customer acquisition and retention.  Now, the business folks can definitely understand it in those terms.  But, they’ll still want it faster anyway.

Where To Sit

So, where do you then sit on that curve?  The real answer is, it doesn’t really matter how much you do or don’t want to make performance optimizations, particularly if they’re approaching the infinite cost asymptote of that graph.  The answer is, you will have to sit wherever your competitors sit.  Most of us out there building the next great thing aren’t making markets, we’re creating displacement products.  For those of us doing so, we’ve got to chase after wherever your most successful competitor sits on the marginal cost of per speed graph.  Now, to be fair, those guys have probably been working for a few years on their ascent up that cost-performance climb, and they probably have deeper pockets / more slack time to do so than you do if you’re breaking into a market, but there is a trade-off the suits can make.  The accumulated cost to 90% of the graph is less than the whole last 10%, so put another way, if you can be at least performant to make 90% of those prospects who are 100% happy with your competitor’s product, that may well be enough to get displace enough business to let you keep tackling that last mile another day.

Obviously, this question can’t be completely answered that way, because it’s highly dependent on your specific markets.  Are you entering a market with a democratic offering of grass-roots, home-grown alternatives or are you tackling an oligarchy industry?  Are you targeting disparate customers, or are your customers banded together in trade associations — which translates to — how much does your reputation change for each success or each failure?  How are your customers allowed to back out of a contract if they find performance or other factors don’t match the vision sold to them?  These answers may make the “how fast does it need to be” answer necessitate a disproportionately higher amount of resources and time to get it where it needs to be to have a good, marketable value proposition.

In summary, you never really should sit anywhere on that curve, you should be climbing it.  It will cost you more the further you climb, but you should never feel like you’re done optimizing performance, and you should never stop continuously reviewing it.  Remember how I mentioned most of us are in the displacement business?  Even if you’re not, someday, someone else will be, looking to displace you.  That guy might be me someday, and rest assured, I won’t rest assured anywhere. 🙂

 
Leave a comment

Posted by on May 2, 2012 in User Experience