Thoughts on passing the GIAC Security Essentials (GSEC)
Today I passed the GIAC Security Essentials Certification, also known as the GSEC. I passed with a 95% on my first certification attempt, so I thought it might be useful to decompose my thoughts on this one for any who attempt it in the future.
My background is technical - I started my career in software engineering and database performance tuning, moved into engineering leadership roles, and eventually ended up pursuing my interests in cybersecurity, where I have been a CISO at two financial services firms. Yet, I’m still very hands-on during the day, and I recently wrote a QUIC userspace implementation to learn the spec in the evenings. I have previously earned the CISSP and CISM certifications, although these are more leadership and risk management focused credentials that don’t speak much to technical aptitude as it relates to security. For that reason, as well as my personal desire to keep my technical skills sharp while also working at the executive level and leading a team, I decided to apply to, and was accepted into, the SANS Technology Institute’s (STI) Masters of Science in Information Security Engineering program. The first stop along the MSISE journey is the GSEC.
As part of the MSISE program, I pay tuition for a graduate class to gain access to SANS training and the associated GIAC exam which provides me with a grade for my course. This was my very first SANS training and my first GIAC exam. There was an option provided for me to directly challenge the exam since I do have and did recently earn my CISSP, but my STI student advisor kindly recommended I take in the full training experience. I was admittedly reluctant both because I feel I am pretty strong technically and because it would have been slightly cheaper and faster for me to just go straight to the GSEC exam, but the advice was well founded.
The SANS SEC 401 class by Dr. Eric Cole was outstanding. Dr. Cole’s presentation style feels genuine and engaging over the self-paced OnDemand modality I chose. I walked into this content with the preconceived notion that much of this would be review for me, and honestly, a lot of it was for me. This isn’t to state the course is remedial, simply that as a builder of security programs, the concepts and advice aren’t new to me, but some of the technical pieces were. I learned new and useful tools as part of this course, and I could see this as an excellent foundational course for current and aspiring security team members in any organization. Finding high quality training content is exceptionally valuable to me in my day job, and of course for me personally taking this course.
As other GIAC alumni will tell you, since the GSEC is an open-book exam, developing good indexing skills, as others who have recounted their experiences state, is critical. I followed Josh Armentrout’s index format, and I walked into the exam with about 4 pages of indexes I developed throughout the course. Admittedly, the way I learn best is by reading, so I spent my time in SEC 401’s OnDemand video with Dr. Cole on x2 speed and scanning pages in the book as I went along for index-worthy concepts or terms. I did not spend any time highlighting the books or listening to MP3’s, just focusing on the audio and what I was reading. I would finish a ‘day’ at 2x in about 2 nights of my time, devoting about 5 hours a night for a couple of weeks to get through it all with a worthwhile index. There’s no specific tips or tricks to the content – the course syllabus plainly states what will be covered, and that’s the reality of what OnDemand provided. I will say read your entire book. Sometimes key concepts have interesting nuances that end on the back of a page on a trailing paragraph. Don’t skip those.
With my course, in addition to the self-study quizzes in the OnDemand portal - which test the content of SEC401, not the GSEC - I received two GIAC practice tests and the final GIAC exam test ready to schedule. While everything in the OnDemand portal is self-paced, repeatable, and not timed (other than the overall subscription access), the GIAC practice tests are delivered in the same format as the exam - timed, but they also provide explanations for any incorrectly answered questions. The MSISE program has a learning community portal where generous souls who do not use both GIAC practice tests give away their tests to others who want extra shots. While that’s awfully nice of them, and I was tempted to do the same, I found value in taking both practice tests to test and refine the quality of my index. I’m glad I did, and would suggest never to give away a practice test if you feel you could use it to benefit your index or your comprehension of the breadth of the training topics. (Hey, you paid for these practice tests, so you come first.) I took my first practice test as an ‘open internet’ variant where I would quickly Google something to answer the test, but then make sure my notes were fully fleshed out from what external sources could add. My last practice test was ‘closed internet, open book’ to mimic the actual exam experience, and this was a last test of my index for completeness, since that’s all I would have on the test day. Obviously, I carefully read the explanations to anything I answered incorrectly and tuned my notes and did additional readings to make sure I did not repeat any misfires.
Finally, exam day came today! I’m no stranger to these types of tests or Pearson Vue, so the experience was predictable and suitable. It is interesting walking into a Pearson Vue with an armful of books since most exams they test for allow no notes or books. I came in with all six course books, the lab workbook, the network quick reference guide, my index, and a separate page of notes I made about common ports and protocols that were not on the network quick reference guide but were mentioned elsewhere in the course material. I used everything I brought in, if only to take the exam at a ‘leisurely’ pace and spend adequate time double checking my answers.
Unlike the CISSP or CISM which are based on practical experience (with the exception of the CISSP’s strange obsession with fire suppression controls…), the GSEC was much more knowledge-based, specifically on the SEC401 training materials. So, the right answer is less likely to come from things you already know (come on, you don’t really know ALL those nmap switches), but from what you have learned and can recall or find. Arguably, this is a bit more realistic, as aren’t all technical folks somewhat depending on their navigation of StackOverflow or Google-fu? :)
It’s hard to know from the outside whether SEC401 is custom tailored to the GSEC, or whether the GSEC is really testing SEC401, but they fit together like pieces of a puzzle. Answers to questions often came nearly verbatim from the slides, or more often, the narrative, in the SEC401 books I had in tow. That’s not a knock on the SANS content or the GIAC exam - I call this out simply to advise those studying for the GSEC to intimately know the SEC401 material as it is presented in the books. Treat the high-quality OnDemand video as a wonderful supplement, but don’t go light on your reading and indexing of your spiral bound friends. Also, do the labs, and repeat them until you could recognize a screenshot of output to a tool you covered in the curriculum or in a lab. If you couldn’t recognize a screenshot or command well enough by sight, you probably aren’t soaking in the technical material at the level you need to demonstrate competency at the higher end of the spectrum.
This process got me from a 89% on my cavalier run through the first practice test, to a 92% on my second practice test, to a 95% on exam day. There’s really no tricks to doing well on the GSEC or tricks the exam will try to play on you. It is plainly written, very technical, and you would be a fool not to be prepared with the associated SANS training and a well-crafted index before sitting down to make an attempt. (Check out Lesley Carhart’s great post on studying and indexing too, if you have not already.) Even if you might think ‘I know all this’, you probably don’t have the GSEC cinched unless you give it serious attention and a good study.
I hope this helps someone out there!
Despite DoH and ESNI, with OCSP, web activity is insecure and not private
TL;DR
Certificate Transparency (CT) logs increasingly provide virtually every TLS certificate to be identified by serial number. Since OCSP responses are unencrypted and contain the serial number of the certificate as can be found in CT logs, as well as unsalted hashes of the certificate's Distinguished Name and public key, these can easily be profiled to compromise the privacy of clients even in the presence of DoH and ESNI privacy protections.Background
A lot of great work has happened over the past few years in securing the web by strengthening encryption and improving user security indicators. This helps users make informed decisions to keep their online activity secure and private and to thwart network adversaries from profiling users. Man-in-the-middle attacks on the network often conjure images of someone breaking into a server room and installing some kind of interlocutor spyware device or splicing a network card. Repeatedly, though, the internet service providers that bring the Internet to consumers' homes have demonstrated they will use their privileged position on the network to sell private information about consumer internet use or degrade services from competitors.Policy fixes like network neutrality are still in play, but these threats aren’t unlikely one-offs that target individuals, they are systemic abuses by technology providers. Technology fixes, though, are seeking to limit the visibility of web activity, such as the names of websites one visits or the content they download, indiscernible to anyone except the requester and the actual website operator.
Progress
Significant strides in improving the strength of encryption that makes data in transit unreadable, such as TLS 1.3, have squelched out vulnerabilities that stem from aging cryptographic algorithms and ciphers as well as certain threats that can affect the confidentiality of communications when an encryption key is leaked or a nation-state attacker. However, metadata that is exchanged in the process of finding a server and securely establishing a connection, DNS and TLS with a Server Name Indicator (SNI), can still leak and poses both an existential privacy problem that is particularly troubling to vulnerable populations under repressive regimes as well as a method for sophisticated technology providers in 'free' societies to profile traffic for bandwidth discrimination, censorship, or profiteering.A couple of standards have gained traction to address these weaknesses in DNS and TLS, with proposals termed DNS over HTTPS (DoH) and encrypted SNI (ESNI), respectively.
DoH
DoH moves the plaintext game of 'telephone' whereby a client's request to resolve a URL into an IP address may traverse many different servers operated by many different entities to look up and return the answer. DoH moves this communication from an unencrypted channel to an encrypted one, which still requires one to trust the privacy policy of the entity servicing the request, but does not need to presume the good behavior of every intermediate network and DNS server in the mix. This is a very good thing we will see rolling out in the next few years in a much wider adoption.ESNI
ESNI is a proposal to plug a hole in an extension of the Transport Layer Security protocol (sometimes incorrectly referred to by its obsolete predecessor, SSL) which allows for encrypted communications to happen over a channel in a standard way for many applications. In the web's early days, users would connect to a web server, such as Yahoo.com, and Yahoo.com would return a signed certificate that could be used to setup a secure communications channel.However, as the web matured, methods for hosting many different sites on the same server or set of servers took off and there was no longer a 1:1 match for a domain name and a web server. SNI was an extension that lets a client, like a web browser, specify “I want Yahoo.com” so the web site provider could return the correct, unique certificate to setup the channel for Yahoo.com, even though it could also be serving lots of other sites too. However, the “I want Yahoo.com” is exchanged in plain-text before the certificate is provided and before an encrypted channel is established.
That means savvy technology providers could just look here instead of logging DNS requests for similar data on what host names to which a customer is attempting to connect. This is becoming far more viable as HTTPS Everywhere, user agent changes, and free certificate authorities like Lets Encrypt are making ‘secure by default’ the new reality for the web. More TLS means more encryption, but also more consistency in finding hostnames in SNI fields.
Problems
CT Logs
TLS is underpinned by a system of trust, particularly in the entities called Certificate Authorities that cryptographically sign certificates used to establish encrypted communications. However, certificate authorities are fallible, and some have failed due to security breaches or by failing to abide by the rules and mis-issuing certificates. Some of the most egregious offenses from failed certificate authorities like DigiNotar, Symantec, and WoSign/StartCom have resulted in technology solutions that make it possible to hold them accountable. Certificate Transparency (CT) logs are a public ledger of certificates issued by authorities that allow their behavior to be monitored, but also create central clearinghouses of certificates that can be looked up by name or serial number. More on that soon.OCSP
When a certificate is compromised, a certificate authority can revoke it. While normally a certificate has a limited duration noted by an immutable expiration date embedded into it, certificates may be prematurely revoked if the holder or the authority is compromised. The Online Certificate Status Protocol (OCSP) is a protocol clients like web browsers user to verify a certificate it receives is still valid. OCSP lets a client ask "I just received this certificate for Yahoo.com, but is it valid?" The request is obscure, but not secure:
The request has a one-way hash of the distinguished name and public key in the certificate as well as the serial number of the certificate. Unsalted hashes mean anyone could poll CT logs for all distinguished names, build their own hash lookup dictionary, and then compare this value to their dictionary. However, the unhashed serial number makes this far easier, as many CT logs support direct lookup of certificates by their serial number. In the following screenshot, you can see a trivial lookup to find out my lab virtual machine was connecting out to support.mozilla.org.
Summary
This is not a new vulnerability. In fact, RFC 6960, which defines OCSP, explicitly states:Where privacy is a requirement, OCSP transactions exchanged using HTTP MAY be protected using either Transport Layer Security/Secure Socket Layer (TLS/SSL) or some other lower-layer protocol.Incorrectly, some presume OCSP must be performed over insecure HTTP to address a address a 'chicken and egg' problem that would arise from trying to validate the certificate of a secure OCSP site to validate the certificate of another secure site. While implementation details could be non-trivial, solutions like pinning the TLS certificates of well-known OCSP responders could address that challenge.
It is important, though, to consider that in the cat-and-mouse game of threats to privacy and privacy-protecting technologies, OCSP is a more readily available source of metadata on users as HTTPS adoption increases, CT logs become mandatory and pervasive, and insecure OCSP communications dominate the responder implementations. As other privacy holes are addressed, such as DoH and ESNI, to keep users' Internet activity private, OCSP is a challenge at scale to address as well.
PowerShell one-liner to find outbound connectivity via WinRM
In controlled environments, it’s useful to know when outbound connectivity is not restricted to a predefined list of required hosts, as many standards like PCI require. Here’s a helpful one-liner that will query your Active Directory instance for computer accounts that are enabled, and then for each of them try to connect to a site from that machine, as orchestrated by WinRM. If you use this script, just know that you will probably see a sea of errors for machines that connect be reached from your source host via WinRM. My go-to site for testing non-secure HTTP is asdf.com, but you could use anything target and port you desire based on what should not be allowed in your environment. I have changed the snippet below to example.com (which will not work) so I don’t spam the poor soul who runs asdf.com, but you should replace that with google.com or whatever host to which you wish to verify connectivity.
Invoke-Command -ComputerName (Get-ADComputer -Filter {Enabled -eq "True"}
-Property Name,Enabled | foreach { $_.Name }) -ScriptBlock
{ Test-NetConnection -Port 80 "example.com" | Select TcpTestSucceeded }
The output will be dropped into look something like this:
TcpTestSucceeded PSComputerName RunspaceId
---------------- -------------- ----------
True YOUR-HOST-1 d5fd044c-c268-460e-a274-d3253adc8ce2
True YOUR-HOST-2 98206f71-80c1-4e7e-a467-fec489c542ee
False YOUR-HOST-3 d0b6cf57-e833-44a6-a7bb-aebd4d854b5c
True YOUR-HOST-4 14af618b-1ca7-4c1f-bb56-ce58dbd4af94
It’s a great sanity check before an audit or after major changes to your network architecture or security controls. Enjoy!
SQL Injection with New Relic [PATCHED]
Background
First off, I have found New Relic to be a great application performance monitoring (APM) tool. Its ability to link transaction performance from the front-end all the way to back-end database queries that slow your web application is pretty awesome. This feature lets you see specific queries that are running slowly, including the query execution plans and how much time is spent on processing various parts of a database request. From their online documentation, the interface looks similar to this:What’s not so awesome is when your APM’s method for retrieving this data creates a SQL injection flaw in your application that wasn’t there before. In October 2016, I became aware of some strange errors when a DBA was trying to load SQL Server trace files into PSSDiag, due to a formatting problem in the trace file itself. Our DBA discovered that unclosed quotation marks were causing problems with PSSDiag loading trace files. So, how could an unclosed quotation mark even be happening? It’s a hallmark of a SQL injection exploit, and so I began digging.
It appeared our ORM (NHibernate at the time) was sending unparameterized queries, and one of the field values had an unescaped quotation mark, which was causing the error in PSSDiag. However, in other cases the same query, unique to an area of our code, would be issued with parameters. Upon further digging, it actually appeared our application was submitting the same query twice, first with the parameterized query version, and a second with parameter values replaced into the query string, sandwiched with a SET SHOWPLAN_ALL. It looked a bit like this:
exec sp_executesql N'INSERT INTO dbo.Table (A, B, C)
VALUES (@p0, @p1, @p2);select SCOPE_IDENTITY()'
,N'@p0 uniqueidentifier,@p1 uniqueidentifier, @p2 nvarchar(50)'
,@p0='{Snipped}',@p1='{Snipped}',@p2=N'I don''t even'
Followed by:
SET SHOWPLAN_ALL ON
INSERT INTO dbo.Table (A, B, C)
VALUES ('{Snipped}', '{Snipped}', 'I don't even');select SCOPE_IDENTITY()
As you can see in the first example created by NHibernate, the word “don’t” was properly escaped; however, in the subsequent execution, it was not. This second statement is sent by our very same application process, which New Relic will instrument using the ICorProfilerCallback2 profiler hook to retrieve application performance statistics. But it doesn’t just snoop on the process, it actually hijacks database connections to periodically piggyback on their ‘echo’ of requests to retrieve metrics used to populate their slow queries feature. The SET SHOWPLAN_ALL directive causes the subsequent request not actually to return data, but to just return the execution plan.
(DBA’s will note this is actually not a reliable way retrieve this data at all, as parameterized queries can and often do have very different query execution plans when parameter sniffing and lopsided column statistics are in play. But that’s how New Relic does it.)
This is pretty bad, because now virtually every user-provided input that is sent to your database, even if programmed using secure programming practices to avoid SQL injection flaws, becomes vulnerable with New Relic is installed with the Slow Queries feature enabled. That being said, New Relic does not send this second ‘show plan’ and repeated statement set for every query. It samples, appending it only onto some executions of any given statement. An attacker attempting to exploit this would not be able to do so consistently; although, repeated attempts on something like the username field of a login screen, which in many systems is likely log to a database table that stores usernames of failed login attempts, would occasionally succeed when the subsequent SHOWPLAN_ALL and unparamaterized version of the original query is injected at the end of the request by New Relic.
Timeline
- October 5, 2016: Notified New Relic
- October 5: New Relic acknowledges issue and provides a workaround (disabling explain plans)
- October 6: New Relic's application security team responds with details explaining why they believe the issue is not exploitable as a security vulnerability. Their reasoning is based on the expected behavior of SHOWPLAN_ALL, which would not execute subsequent commands
- October 6: I provide a specific example of how to bypass the 'protection' of the preceding SHOWPLAN_ALL statement that confirms this is an exploitable vulnerability.
- October 6 New Relic confirms the exploit and indicates it is targeted for resolution in their upcoming 6.x version of the New Relic .NET Agent. I confirm the issue in New Relic .NET Agent 5.22.6.
- October 7: New Relic indicates they will not issue a CVE for this issue.
- October 12: New Relic updates us a fix is still in development, but a new member of their application security team questions the exploit-ability of the issue.
- October 12: I provide an updated, detailed exploit to the New Relic security team to demonstrate how to exploit the flaw.
- November 8: Follow-up call with New Relic security team and .NET product manager on progress. They confirm they have resolved the issue as of the New Relic .NET Agent 6.3.123.0.
- November 9: .NET Agent with issue fixed addressed.
- May 26, 2017: Public disclosure
Conclusion
First off, I want to applaud New Relic on their speedy response and continued dialogue as we worked through the communication of this issue so they understood how to remediate it. On our November 8 call, I specifically asked if New Relic would reconsider their stance of not issuing a CVE for the issue, or at least clearly identify 6.3.123.0 as a security update so developers and companies that use this agent would know they needed to prioritize this update. They thoughtfully declined, and I did inform them that I would then be publicly disclosing the vulnerability if they did not.Even if I don’t agree with it, I understand the position companies take about not proactively issuing CVE’s. However, I do believe software creators must clearly indicate when action is needed by their users to update software they provide to resolve security vulnerabilities. Many IT administrators take the ‘if it’s not broken, don’t update it’ approach to components like the New Relic .NET Agent, and if no security urgency is communicated for an update, it could take months to years for it to be updated in some environments. While some companies may be worried about competitors' narratives or market reactions to self-disclosing, the truth is vulnerabilities will eventually be disclosed anyway, and providing an appropriate amount of disclosure and timely communications for security fixes is a sign of a mature vulnerability management program within a software company.
Also, be sure if you put any mitigation techniques in place that they actually work. We stumbled upon another bug in working around the issue that was subsequently fixed in 6.11.613 where trying to turn off the ‘slow query’ analysis feature per the New Relic documentation did not consistently work.
Given the potential gravity of this issue, I have quietly sat on this for almost 7 months to allow for old versions of this agent to be upgraded by New Relic customers, in the name of responsible disclosure. I have not done any testing on versions of New Relic agents other than the .NET one, but I would implore security researchers to test agents from any APM vendor that collects execution plans as part of their solution for this or similar weaknesses.
Last weekend, I did some sprucing up of my public website. It’s just a simple static one-pager, but why on earth keep a Windows box just to host that? It was long overdue for me to move something simple into something more cost effective that I could securely manage easier. In case others are looking for a quick recipe book on the same, here’s what I did last weekend:
Spin up an Encrypted Linux AMI in AWS
My objectives in this move were to (1) keep it simple, and (2) keep it secure. I've already enjoyed great success using NGINX at Alkami in getting the best security posture possible for TLS termination, and NGINX can be more than just a reverse proxy - it can also work as a blazingly fast web server for static content too. Dusting off my Apache skills just for this project seemed unnecessary, so for this recipe, we're going to be setting up NGINX as the only server process for this static site. If you aren't familiar with NGINX... don't fret - I'm going to make it easy to configure and explain each step along the way, although you can reference great AWS documentation here too.Encrypt your AMI
If you're a Linux guy, you probably have a distribution already in mind. For this project, I'm fine with the standard machine image Amazon AWS puts together, and I don't necessarily need to worry about which package manager I should use or what filesystem or startup configuration file layout I prefer to maintain. Going with a plain vanilla Amazon Linux AMI, (AMI ID ami-178ef900), I:- Created a new AWS account. This is very easy to do with a credit card, although I'll be using the Free Usage tier of services for this recipe and don't plan to go over those thresholds.
- Went to the EC2 console - that's the Elastic Compute Cloud - and click the AMI's option under Images on the left.
- Searched for AMI ID ami-178ef900
- Right-clicked to select the result and chose Copy AMI, selected the Encryption option, and confirmed Copy AMI.
- Here you have an option of getting fancy with key management and creating a special key for this encrypted operating system image. We don't need to be fancy, we just need to be secure. If we are using this AWS account just for a public website and for that single purpose, the default key for the account is just fine.
Setup a Secure Security Group
Under Security Groups under Network & Security in the EC2 console, we're going to define who can access our new AMI. To start, we will only allow access to ourselves while we configure and harden it. Only after we're happy with the configuration will be open it up to the world. To do this:- In Security Groups, click Create Security Group at the top.
- Name your security group something simple, like webserversecuritygroup
- Add three Inbound rules
- HTTP from My IP only - this is how we will test insecure HTTP connections
- HTTPS from My IP only - this is how we will test secure HTTP connections
- SSH from My IP only - this is how you will connect to your new AMI with PuTTY or another terminal session manager
- By default your instance can connect Outbound anywhere. Not a great idea for a production enterprise system. For this recipe, we're going to leave this with this default, but we could shore it up later once we get everything like OCSP working near the end. Flipping on a lot of security early on can make this whole process much more painful, so our approach will be (1) to use secure defaults, (2) get functionality working, then (3) harden it.
Launch Your AMI
When the copy completed in about 5 minutes, I was able to right-click the encrypted AMI I just just copied from the source and click "Launch". Herein I was able to select the options for the virtual machine I would boot with this AMI as the image, and in order:- I used the t2.micro instance to keep it simple and free.
- Chose the webserversecuritygroup I created in the previous step
- Selected the VPC and subnet (if you aren't sure about VPC's and subnets, your AWS account comes with a default one you can setup in the VPC option where you previously selected EC2. The first option, just one public subnet, is fine for this application, because we won't have back-end database or file servers that in a more complex environment we would architect for additional layers of security. It's completely unnecessary for this recipe, and quite honestly, I prefer to keep my various recipes in separate AWS accounts to keep cost tracking easier. Don't complicate this for yourself - one public subnet is all you will need and use.)
- You will need to choose what SSH key you will use to connect to the instance in our post-modern, password-less world. If you haven't setup a SSH key yet, you will create one here and just download the .PEM file
- Enabled protection against accidental termination.
- Clicked Review and Launch, then waited about 10 minutes for the machine to spin up.
- You will need to Download
Grab a Constant IP Address
While I was waiting for that, I wanted an Elastic IP. AWS will generate a public IP for your EC2 instance, but that public IP isn't guaranteed to stay with you. We want that guarantee, and at about $3/mo for an Elastic IP address, it's worthwhile not to have to muck with DNS updates any time I may reboot or rebuild a box and potentially suffer through downtime. To get and use an Elastic IP:- Go to Elastic IP's under Network & Security in the EC2 console.
- Click Allocate New Address
- (Once that EC2 instance we're firing up is up and running, then you can) Right click your new address and choose Associate Address. We kept it simple and only have a single EC2 instance in this AWS account, so it's easy to select the only instance to associate this address to it.
- Thanks to the magic of the software-defined networking stack of AWS, you don't need to mess with ifconfig or reboot your AMI once you make this change - you're ready to go.
Prepare the Box
In an enterprise production system, we'd probably already have pristine golden images, fully patched, tailored for our need. Here, we don't, we just went with a reasonable default. But in either case, and especially in this one, we need to make sure we have all the latest patches, so we'll:- Connect to the box using PuTTY using the key-based login of the .PEM file we generated or chose when we launched our AMI.
- Enter ec2-user as our username when prompted
- Enter the password for our key in the .PEM file when prompted, and we're in.
- And if you're not in, either you don't SSH much, or you've forgotten how to use PuTTY. Documentation is your friend.
- Type sudo su to get root access
- Apply updates for your packages, type yum update
- Remember, we're using NGINX, and it's not installed on the default Linux AMI, so we'll simply do yum install nginx to get that happen
- There are other nice things we'll use in the epel-release that make using advanced NGINX features easier, so let's also do yum install epel-release
Configure NGINX
NGINX is pretty simple to configure when you know what options need to be configured. Just like the Linux AMI, it comes out of the box with relatively sane defaults, and we'll use those as a starting point. There are two files of special significane: /etc/nginx/nginx.conf which is the overall configuration for NGINX, and that can 'include' other files from /etc/nginx/conf.d/ We'll use this separation to make minimal changes to NGINX's overall configuration, and keep our site configuration centralized in one site-specific configuration file to make it easy to add another site to the same box in the future.Make sure of a few simple things first in your global /etc/nginx/nginx.conf file. I’m going to reproduce mine and make comments in red.
# For more information on configuration, see:
# * Official English Documentation: http://nginx.org/en/docs/
# * Official Russian Documentation: http://nginx.org/ru/docs/
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
proxy_cache_path /tmp/nginx levels=1:2 keys_zone=default_zone:10m inactive=60m;
proxy_cache_key "$scheme$request_method$host$request_uri";
# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
# for more information.
include /etc/nginx/conf.d/*.conf; This is the line that brings in our site-specific configuration files
index index.html; My site's default is just index.html, so I've simplified this line to make that the only one that can be served by default
server {
listen 80 default_server; Listen on insecure HTTP IPv4 port 80 in this server block
listen [::]:80 default_server; Also, listen on insecure HTTP IPv6 port 80 in this server block
server_name localhost; This will serve as a catch-all, regardless of domain name specified.
root /usr/share/nginx/html;
# Load configuration files for the default server block.
include /etc/nginx/default.d/*.conf;
location / {
limit_except GET {
deny all; This block ensures any HTTP verb that is not GET just gets denied. Remember, we have a simple static site.
}
return 301 https://seanmcelroy.com; This whole 'server' block is for PORT 80 insecure traffic only. We want users to get redirected to HTTPS always.
}
add_header Content-Security-Policy "default-src 'none'; script-src 'self'; img-src 'self'; style-src 'self'"; We'll cover this CSP line later.
add_header Strict-Transport-Security "max-age=31536000" always; Instruct the browser to never again ask for this site except using HTTPS
add_header X-Content-Type-Options nosniff; Tell the browser not to try to second-guess our Content-Type HTTP response headers; old browser security problem
add_header X-Frame-Options DENY; We don't use IFRAME's, so if someone tries to frame this site in a user's browser, the browser should just error out
}
}
As you’ll note from your own initial NGINX configuration file, I removed a lot of commented-out lines and added a few things I documented above in red with reasons. The meat of our site though, will be our next file, that enumerates the settings for our target domain, in this case, seanmcelroy.com. Before we go through that, though, let’s establish where exactly some paths are going to be we’ll reference in the configuration file that follows. We’re going to put our website in /var/www/seanmcelroy.com/public. We’ll place any secure keys like our TLS certificate in /var/www/seanmcelroy.com/key. I’m going to want to find logs for this site in a predictable place that aren’t co-mingled with other sites I might add in the future, so I’ll also need a /var/www/seanmcelroy.com/log directory. So, let’s do this, at the command line:
- Create the www directory: mkdir /var/www
- Create the site-specific directory: mkdir /var/www/seanmcelroy.com
- Set permissions on these directories by issuing
- chmod 755 /var/www
- chmod 755 /var/www/seanmcelroy.com
- Set ownership on these directories to the root user and the root group by issuing
- chown root:root /var/www
- chown root:root /var/www/seanmcelroy.com
- We'd like put place our site in the public directory, but we don't want to have to act as root each time we want to edit them... so we'll create that one slightly differently
- mkdir /var/www/seanmcelroy.com/public
- chown root:ec2-user /var/www/seanmcelroy.com/public
- chmod 775 /var/www/seanmcelroy.com/public
- This way, anyone in the ec2-user group can also edit the files herein
- NGINX will need to read the keys for this site, so we'll need to do some special permission settings on the key subdirectory
- mkdir /var/www/seanmcelroy.com/key
- chown nginx:nginx /var/www/seanmcelroy.com/key
- chmod 550 /var/www/seanmcelroy.com/key
- Now, only NGINX can read the keys herein
- NGINX will need to write log files out, so we'll do something mostly similar for the log directory
- mkdir /var/www/seanmcelroy.com/log
- chown nginx:nginx /var/www/seanmcelroy.com/log
- chmod 750 /var/www/seanmcelroy.com/log
- We need a certificate from our website issued by a certificate authority in a standard .PEM format. You have a few options here:
-
You can get a free DV cert from StartSSL. This is the dirt-cheap solution, but given StartSSL was recently purchased by WoSign in a clandestine acquisition, and WoSign has had multiple and serious security lapses, you should not be trusting or providing support this entity. - You can get a free DV cert from Let's Encrypt, if you're savvy enough to set up the automated renewal these 90-day in duration certificates require. If you're this savvy, though, you probably aren't reading this blog, because you likely know much of the NGINX configuration I'm about to describe. In addition, you would need to automate the rest of the configuration to handle frequent certificate rotations and the updating of subsequent key files and DNS entries for some of the fancier things we will do, like DANE, near the end.
- You could buy a relatively cheap DV certificate from an authority like GoDaddy
- You could pony up for a mid-tier OV certificate from an authority like Entrust
- If, and only if, you have a registered business with a DUNS number, you can get the top-tier assurance EV certificate from an authority like Entrust
- ... and let's face it, HTTP is so 1994. Soon Chrome will warn users who visit HTTP sites that your site is insecure by default, and that's not what you want to project, so you will pick one of the 5 options above.
-
- Your certificate is likely issued from a certificate authority's trusted root certificate, which in turn has trusted an issuing certificate, which in turn has issued the certificate you acquire. Or, sometimes instead of three links to this chain (root, intermediate, leaf), there are four (root, intermediate1, intermediate2, leaf). This is important because you will need a few files here:
- Your leaf's private key, what I will call seanmcelroy.com.private.rsa below
- Your leaf's public key + your intermediate(s) public keys, what I will call seanmcelroy.com.chained.crt below.
- Your leaf's public key + your intermediate(s) public keys + your root certificate authority's public key, what I will call seanmcelroy.com.chained+root.crt below.
- Some very big and unique prime numbers used for Diffe-Hellman key exchange, what I will call seanmcelroy.com.dhparam.pem below
- To accomplish this, you will use openssl. You will not use online converters to which you upload your private keying material to and let it do the work for you. You will not use online converters. You will never, ever upload your private key anywhere that's not encrypted, and when you do, you would never supply or keep the passphrase for it in any connected container. Here's some openssl cheatsheet commands for you, presuming you obtained a .PFX file that contains your public and private key combined.
- Export the private key for your leaf certificate into a seanmcelroy.com.private.rsa file from a seanmcelroy.com.pfx file. These literally are the keys to your kingdom - the private key without encryption or passphrase protection.
openssl pkcs12 -in seanmcelroy.com.pfx -out seanmcelroy.com.private.rsa -nocerts -nodes - Export the public key for your leaf certificate into a seanmcelroy.com.public.crt file from a seanmcelroy.com.pfx file.
openssl pkcs12 -in seanmcelroy.com.pfx -out seanmcelroy.com.public.crt -clcerts -nokeys - Export your issuer's root public certificate into a file
openssl pkcs12 -in seanmcelroy.com.pfx -out seanmcelroy.com.root.crt -cacerts - Create your DH primes for key exchange. You don't have to understand what this is in-depth, but you should understand it could take 10-15 minutes to complete.
openssl dhparam -out seanmcelroy.com.dhparam.pem 4096 - Now, let's create that chained file. OpenSSL strangely doesn't export a chain in the proper order. You can either manually save the intermediate certificate in a .PEM format (called entrust.com.L1K-chain.crt in the example below) and do:
OPTION 1) cat entrust.com.L1K-chain.crt seanmcelroy.com.public.crt > seanmcelroy.com.chained.crt - OR, you could alternatively type this and hand edit the resulting file to order the exported certificates in reverse order
OPTION 2) openssl pkcs12 -in seanmcelroy.com.pfx -nodes -nokeys -passin pass:<password> -out seanmcelroy.com.chained.crt - And finally, we need to get our chained+root file, so we can do:
cat seanmcelroy.com.chained.crt seanmcelroy.com.root.crt > seanmcelroy.com.chained+root.crt - (Finding the right OpenSSL can be time consuming if you don't know them already. In this example, I'm presuming you may have generated your CSR using IIS and completed it in there or another Windows-based system to get the resulting PFX we worked from, but if you read the OpenSSL documentation, it can handle many different input formats that don't require Windows or a PFX artifact at all.)
- Export the private key for your leaf certificate into a seanmcelroy.com.private.rsa file from a seanmcelroy.com.pfx file. These literally are the keys to your kingdom - the private key without encryption or passphrase protection.
- This setup will only enable TLS 1.2 and 256-bit ECDHE and DHE RSA, leaving in the dust IE 10, Android 4.3 and earlier, and about every Java client out there as of this writing. I'm choosing security over accessibility so I get the principle of Forward Secrecy, and that sweet, sweet 100% Protocol Support rating in Qualys. If this was a production legacy site, you'd want to really think about these options, because a granny on a Tracfone stuck on Android 4.2 could be frustrated by your choices here, frustrating your call center as well.
- We don't want to deal with CRIME-mitigation, so gzip is going to be disabled. A complex production site may want to weigh this or implement gzipped cache assets differently, but our use case will keep it simple.
- We will use custom Diffe-Hellman (DH) prime numbers. Default implementations often use "well-known" primes that weaken your security and amplify the impact of vulnerabilities like LOGJAM and FREAK.
- We will enable OCSP stapling to improve page load times. This means NGINX will reach out to get OCSP responses from your root CA occasionally, so you can't turn off your Outbound connectivity in your EC2 security group without ensuring DNS and the ports used for this lookup remain open.
- We are going to PIN our TLS certificate public key using HTTP Public Key Pinning (HPKP)
- This means the server will tell the browser, "You should expect to always see THIS certificate in a certificate chain coming from this site for at least THIS amount of time"
- It also means we need to get a 2nd certificate as a backup, which is not part of the certificate chain of the first certificate.
- Which means double your money to buy a second certificate... hopefully with a different expiry period from the first
- Or, you get a dirt-cheap DV certificate as your emergency backup, and you use an EV or OV certificate as your primary one.
- To generate these hashes, you can check out Scott Helme's HPKP toolset - super useful! Or, Qualys' SSL Server Test can tell you at least the hash of the currently-presented certificate.
- We are going to instruct the browser that from now on, NEVER ask for this page over HTTP (or let Javascript make such a request) - HTTPS only from here on out. This is the Strict-Transport-Security header, otherwise known as HSTS.
- We are also going to have a tight policy on what our website should do using the Content-Security-Policy header, also known as CSP. Beware this header -- it takes time to test your policies proportionate to the complexity and number of pages on your site. If you are a web developer, you can open up Chrome DevTools or Firebug to view problems with your policy of "default-src: none" and handled each type of error one by one to get a custom, strict policy. Various groups debate the usefulness of CSP, and Google recently cast doubt on its efficacy. I wanted the bells and whistles, so it was worth 15 minutes for me to get my 1 page website working with it... but if you notice browser rendering problems, you will want to strike the relevant add_header line complete for this.
- We are going to instruct browsers not to guess on the MIME content types of our resources, but rather to just trust our Content-Type HTTP response headers. Some older browsers had security issues in their code that tried to read files to determine this. Modern browsers don't have this issue (and older browsers won't be able to speak the TLS 1.2 baseline requirement in this configuration anyway), but we simply want to deter the practice.
- Our site should never be in an IFRAME, so to protect from clickjacking, we instruct the browser to enforce this expectation.
# # seanmcelroy.com #Once your site is up and running, don’t forget to update your EC2 attached Security Group to make HTTP and HTTPS available from Anywhere. Go ahead and leave SSH as “My IP”, or simply remove it when you are done and add it back when you need to, as your IP can shift in the time between connecting to this server again.server { listen 80; For insecure HTTP port 80… server_name seanmcelroy.com www.seanmcelroy.com; And for either domain name, with and without the ‘www’…
Discourage deep links by using a permanent redirect to home page of HTTPS site
return 301 https://$host; Redirect to the HTTPS version }
server { listen 443 ssl; But for secure HTTP port 443… server_name seanmcelroy.com www.seanmcelroy.com; And for either domain name, with and without the ‘www’…
Server headers
server_tokens off; Don’t show the end-user the version of NGINX we run. Security through obscurity…
ssl_certificate /var/www/seanmcelroy.com/key/seanmcelroy.com.chained.crt; We serve up the intermediaries and our leaf public key; mobile devices need this. ssl_certificate_key /var/www/seanmcelroy.com/key/seanmcelroy.com.private.rsa; Our private site key used for the transport encryption ssl_protocols TLSv1.2; We are only going to enable TLS 1.2 ssl_ciphers ‘AES256+EECDH:AES256+EDH:!aNULL’; First prefer Elliptic Curve Diffe-Hellman AES-256 or better, and finally, regular DH AES-256 or better… or bust! ssl_prefer_server_ciphers on; If the client prefers different ciphers… too bad! We make the rules of the cipher negotiation.
DH primes
ssl_dhparam /var/www/seanmcelroy.com/key/seanmcelroy.com.dhparam.pem; Use our custom DH parameters
For OCSP stapling
ssl_stapling on; Enable OCSP stapling ssl_stapling_verify on; Make sure the stapling responses match our chained+root file ssl_trusted_certificate /var/www/seanmcelroy.com/key/seanmcelroy.com.chained+root.crt; … THIS chained+root file resolver 8.8.4.4 8.8.8.8; Use these nameservers to resolve OCSP servers for the stapling
For Session Resumption (caching)
ssl_session_cache shared:SSL:10m; Allow TLS resumption for up to 10 minutes to improve page-to-page navigation speed ssl_session_timeout 10m; Allow TLS resumption for up to 10 minutes
HPKP - public key pinning These are the two hashes of two leaf certificates I use for public key pinning - start with a low max-age, then ratchet it up when tested out
add_header Public-Key-Pins ‘pin-sha256=“qo5XNG/l96xuzO9F+syXML4wY3XAOM3J4r8mquhuwEs="; pin-sha256=“RwJopnm+J6FZTS2jQBnGltzagjpTt62N8Oc4nGEW0Mo="; max-age=3600’;
location / { root /var/www/seanmcelroy.com/public; Our website is served from this root directory index index.html; If no page is specified in a URL and index.html exists for a directory, serve that instead as the default document access_log /var/www/seanmcelroy.com/log/seanmcelroy.com.access.log; Store access log for this particular site into my custom log file expires 30d; Let the browser know it could cache these pages for 30 days; tune if you manually update your static site often… but I bet you won’t. proxy_cache default_zone; gzip off; Don’t compress so we avoid TLS issues like the CRIME attack
limit_except GET { deny all; If the browser requests an HTTP verb other than HEAD or GET, deny them. } }
add_header Content-Security-Policy “default-src ‘none’; img-src ‘self’ data: https://www.google-analytics.com; style-src ‘self’ ‘unsafe-inline’ https://fonts.googleapis.com; script-src ‘self’ ‘unsafe-inline’ ‘https://fonts.googleapis.com https://www.google-analytics.com; font-src ‘self’ https://fonts.gstatic.com”; add_header Strict-Transport-Security “max-age=31536000” always; add_header X-Content-Type-Options nosniff; add_header X-Frame-Options DENY; }
That’s all for this configuration installment. Next time, I’ll probably be covering the how-to’s of DNSSEC, DANE, and OpenPGP PKA records for DNS-based security assertions and key publishing, but at least by the end of this article, you should be able to configure a relatively secure NGINX static content HTTP server, with many of the security bells and whistles enabled.
First Impressions Matter
When it comes to researching vendors, first impressions matter so much. I tend to judge any potential vendor by its sales apparatus, not just because it is the first impression, but because that positioning and interaction will tell you so much more than any press release, executive ‘corporate culture’ communication, or other third-party source of information on financial or industry strength. Things I notice right off the bat that influence my decision to continue engagement or build trust:
Is the sales channel optimized?
Building great companies and great products is all about optimization at a later stage of an organization's maturation life cycle. Idea-driven founding staff are joined or replaced by data-driven staff as a company's offering is validated and it grows to benefit from economies of scale and to show profitability to patient investors and equity holders. The distance between my interest and the vendor's name recognition is a marketing issue, but the distance between my identification of a vendor and getting a meaningful response from their sales organization is a sales/company issue. If I'm clicking through a brochure-ware website to find the place to start engagement, filling out a general 'Contact Us' form, navigating a tedious phone tree, or heaven-forbid, clicking a 'mailto:' link to type my interest, then I've already learned a lot about your company. I've learned one of the following statements is true:- The number of client contacts you deal with through this channel is relatively small: you are new or slow to acquire customers through it
- Your company is too focused on the ideation and 'fun' phase of the business to optimize your sales channel - your company may not be mature enough for my needs
- Your company is too focused serving existing customers (keeping the wheels on) to work on growing your business by optimizing sales channels - your company may not be ready for my needs
- Your company is mature but not thinking about data-driven results, which tells me your product probably isn't either.
What is the quality of the first contact?
Did the person who responds to my inquiry bother to look up the domain of my e-mail address to check out what my company does? Does that sales executive reference recent PR releases we made? This is a high-quality contact and this action shows me your sales executives aren't quote-monkeys or order-takers, they are relationship-builders. Or did I just get a form letter thank me for my form entry and letting me know someone may get back to me about whatever my interest might be? If it is the latter, this tells me:- Your company will require me to tell, and you probably won't ask. I'll need to know what I want and be prepared to demand. Since from the start of the relationship, there was little concern for finding a good fit, I will have extra heavy lifting to do.
- If you are asking what my interest is and you don't already know, then that probably means you haven't placed me in any segment or internal classification that represents the nature of my potential demand. That tells me the out-of-the-box customization of the solution may be low, or if not, you are not capitalizing on the specialized needs of different classes of customers.
- If I get an "I don't know" in the first conversation, that is okay, but it tells me I'm either working with someone that does not know their product well (new or inexperienced), or the sales group is not connected to the product group, which is a more fundamental problem. The most important communication line is (in my view) between sales and product, and secondly between sales and operations to ensure in order that: (1) pre-sales the right solution is sold to a customer ... if that doesn't happen everything else will fail ... and (2) post-sales the requirements are appropriately communicated to deliver a synchronized expectation and final result.
What is the speed of the first quality contact?
- If I get a poor-quality first contact very fast, I presume I'm talking to someone young and hungry. This can be a good sign if I need a lot of attention or customization and you're not a large player. This is a very bad sign if you have a signature single product and are an established company, as I assume there's inadequate sales training or high sales churn, both of which send a negative signal about your company's position and our potential together.
- If I get a high-quality first contact very slowly, I'm not thrilled, but I'm willing to wait and pay for quality. Not everyone is, but that's how I do business.
- If I get a poor-quality contact very slowly, you really shouldn't be in business, and you probably won't be anymore very soon.
Alkami: Genesis
In the summer of 2008, I was preparing a large strategic product shift within Myriad Systems, Inc. to unify a suite of ancillary banking productions I had built and managed: remote deposit capture, merchant capture, expedited payments, e-Statements, e-Notices, check imaging, and a one-to-one marketing solution among many others. A key opportunity presented itself in that we had several large and progressive financial institution clients that were interested in what an MSI online banking offering could look like, particularly given the relatively poor user experience in the online banking offerings at the time. This would have completed a big piece of the end-user product portfolio for MSI, and while as daunting as online banking from the ground-up is, it stood to provide substantial strategic value to our whole suite.
Computer Services, Inc. began courting MSI and started a full acquisition in August of 2009. It was clear CSI’s intent was to maximize the value of the print and mail operational assets of MSI, but it had little interest in its online banking products other than to preserve existing revenue streams. This disinterest in the strategic vision of the online web applications as a product portfolio was the impetus for me to pursue my personal career interests of building a best-in-breed online banking solution outside of the MSI umbrella.
Jeff Vetterick and Richard Owens, two industry colleagues that had previously had stints at MSI, reached out when they heard of my desire to continue to build online banking and move on and encouraged me to reach out to Gary Nelson, an acquaintance who was part of the very successful build and sale of Advanced Financial Services to Metavente (an interesting and great story in of itself), who had interest in this as well. After AFS, Gary had many interests and projects, a significant one being part of an idea to build a learning management system that provided tools for schools to impart educational content in an online tool where students would have a fictitious bank account balance and through different learning modules, understand concepts of spending, budgeting, and the time-value of money.
When I spoke to Gary in September, I found this initiative was in wind-down: the project had exceeded its funding, and only an IT manager had been retained as a temporary contractor to document and turn over all the company’s assets. Gary engaged me as a consultant to perform an analysis of the source code developed by that team to determine if there was any value in it as an asset for sale as the company was closed up. I reviewed the company’s source and patents, but when I started looking at the few cloud VM’s and pulled open the Subversion repository where the source code was to be, I found a shocking lack of value: what did exist were some architectural documents and some demoware in the form of static screens coded into a .NET MVC ‘shell project’ that had no actual implementation or integration of the key concepts around educational content delivery and assessment. Looking back at the Finnovate presentation the team from this company did, I found only that minimal proof of concept presented on stage, but little more complete.
The internal company documentation in the form of ‘wikis’, agile storyboards, and some unorganized developer notes showed no cohesive technical direction or architectural plan. When I began reviewing invoices for consultants and local contractors, a sad picture materialized: I felt Gary and other investors had been somewhat duped by a mixture of technical ineptitude and probably some overbilling greed by people and local development ‘firms’. I delivered the news that what assets I could find and review had little fire-sale value, other than perhaps one patent that had some intrinsic value, but no implementation. I exemplified this situation by opening the source code for the portion of the system that purported to calculate a ‘relationship score’ about how much an end-user understood financial literacy content and how their behavior in their accounts, transactions, and progress in meeting their financial goals; the source code simply ran in an endless empty loop, doing nothing. Demoware.
After delivering the news to Gary and preparing for whatever my next endeavor would end up being, Gary suggested I reach out to Stephen Bohanon, a consultant with Catalyst Consulting Group who had previous been a high-performing salesperson with AFS. After several discussions, it became clear Gary had an appetite to try a pivot in the financial technology web application space, and both Stephen and I were interested in building a world-class online banking solution - he as a formidably talented sales executive to build relationships and grow the organization, and myself to grow a technical team that would architect and build our next-generation online banking user experience.
And with no pre-existing source code, and only great ideas, tremendous perseverance, and some money (thanks, Gary!), we founded Alkami.
Security Advisory for Financial Institutions: POODLE
Yesterday evening, Google made public a new form of attack on encrypted connections between end-users and secure web servers using an old form of encryption technology called SSL 3.0. This attack could permit an attacker who has the ability to physically disrupt or intercept an end-user’s browser communications to execute a “downgrade attack” that would could cause an end-user’s web browser to attempt to use the older SSL 3.0 encryption protocol rather than the newer TLS 1.0 or higher protocols. Once an attacker successfully executed a downgrade attack on an end-user, a “padded oracle” attack could then be attempted to steal user session information such as cookies or security tokens, which could be further used to gain illicit access to an active secure website sessions. This particular flaw is termed the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack. At this time this advisory was authored, US-CERT had not yet published a vulnerability document for release yet, but has reserved advisory number CVE-2014-3566 for its publication, expected today.
It is important to know this is not an attack on the secure server environments that host online banking and other end-user services, but is a form of attack on end-users themselves who are using web browsers that support the older SSL 3.0 encryption protocol. For an attacker to target an end-user, they would need to be able to capture or reliably disrupt the end-user’s web browser connection in specific ways, which would generally limit the scope of this capability to end-user malware or attackers on the user’s local network or that controlled significant portions of the networking infrastructure an end-user was using. Unlike previous security scares in 2014 such as Heartbleed or Shellshock, this attack targets the technology and connection of end-users. The nature of this attack is one of many classes of attacks that exist that target end-users, and is not the only such risk posed to end-users who have an active network attacker specifically targeting them from their local network.
The proper resolution for end-users will be to update their web browsers to versions that have not yet been released that completely disable this older and susceptible SSL 3.0 technology. In the interim, service providers can disable SSL 3.0 support, with the caveat that IE 6 users will no longer be able to access sites with SSL 3.0 without making special settings adjustments in their browser configuration. (But honestly, if you are keeping IE 6 a viable option for your end-users, this is one of many security flaws those issues are subject to). Institutions that run on-premises software systems for their end-users may wish to perform their own analysis of the POODLE SSL 3.0 security advisory and evaluate what, if any, server-side mitigations are available to them as part of their respective network technology stacks.
Defense-in-depth is the key to a comprehensive security strategy in today’s fast-developing threat environment. Because of the targeted nature of this type of attack, and its prerequisites for a privileged vantage point to interact with an end-user’s network connection, it does not appear to be a significant threat to online banking and other end-user services, and this information is therefore provided as a precaution and for informational purposes only.
All financial institutions should subscribe to US-CERT security advisories and to monitor the publication of CVE-2014-3566 once released for any further recommendations and best practices. The resolution for end-users of updated versions of Chrome, Firefox, Internet Explorer, and Safari which remove all support for the older SSL 3.0 protocol will be made through their respective vendor release notification channels. For more information from US-CERT once published, refer to the Google whitepaper directly at https://www.openssl.org/~bodo/ssl-poodle.pdf
Alkami: A Retrospective
What a wild and crazy journey the last five years have been.
When I started this blog in 2009, it was shortly after I had inked a deal with an angel investor and journeyed down the road with him and my other co-founder and established Alkami Technology. Against significant odds, this October marks the five year anniversary of a roller-coaster ride on up, which galvanized Alkami as the clear leader in the online banking space. Before jumping into this endeavor, I was no stranger to walking products from idealization to realization or running enterprise services in a SaaS model. But doing all that against the tremendous downside risks of the start-up world, as the new kid on the block among a world of established, very-well funded competitors has been challenging. Actually, it’s been brutal.
Reflecting on the past sixty months, I’ve started to pull together my notes from the early days, both before and after founding Alkami, and I will be commemorating this milestone with a series of blog posts on some company history - the why and how, as well as some valuable and hard-learned lessons along the way. No one, no company finds tremendous success spontaneously. While a Inc 500 splash piece on a company might portray success like a serendipitous fairy tale, only through a voracious appetite for risk, an iron stomach for failure, and a committed and skilled team does any great company find its footing. It’s a great feeling to walk into the office every week and see new, fantastic talent we’ve added to our team and forward-leaning designs and concepts in our flagship solution. It’s also a very satisfying one to know your personal efforts and sacrifices made that team and that company possible.
This series of posts will not be a beating of the chest or self-congratulatory account of our accolades. Our work is far from over, and I judge success on a much longer time horizon. But it will be a real account of our origin story, entrepreneurship, missteps and course correction, and moving from start-up to scale-out in a slow sales cycle, highly-regulated industry. It’s one thing to have a hip product idea you incubate through an accelerator and debut on a demo day. It’s a very different thing to bootstrap a firm and an entire platform where you have to answer a few hundred RFP questions to get a prospect to even talk with you, many other steps to get just one sale, and many sales to get that kind of investor attention.
Those pieces are now in place and solidifying every day as we take an aggressive product and technical vision to its successful conclusion. I’m honored to have found great working partners, worked (and still mostly continue to work) with some of the most committed and skilled people across a variety of disciplines along the way. As we look back in retrospect at five formative years, I’m eager to chronicle our story and to add others who will extend and craft our bright future. Stay tuned.
Security Advisory for Financial Institutions: Shell Shock
“Shell Shock" Remote Code Execution and Compromise Vulnerability
Yesterday evening, DHS National Cyber Security Division/US-CERT published CVE-2014-6271 and CVE-2014-7169, outlining a serious vulnerability in a widely used command line interface (or shell) for the Linux operating system and many other *nix variants. This software bug in the Bash shell allows files to be written on remote devices or remote code to be executed on remote systems by unauthenticated, unauthorized malicious users. Because the vulnerability involves the Bash shell, some media outlets are referring to this vulnerability as Shell Shock.Nature of Risk
By exploiting this parsing bug in the Bash shell, other software on a vulnerable system, including operating system components, can be compromised, including the OpenSSH server process and the Apache web server process. Because this attack vector allows an attacker to potentially compromise any element of a vulnerable system, effects from website defacement to password collection, malware distribution, and retrieval of protected system components such as private keys stored on servers are possible, and the US-CERT team has rated this it's highest impact CVSS rating of 10.0.Please be specifically aware that a patch was provided to close the issue for the original CVE-2014-6271; however, this patch did not sufficiently close the vulnerability. The current iteration of the vulnerability is CVE-2014-7169, and any patches applied to resolve the issue should specifically state they close the issue for CVE-2014-7169. Any devices that are vulnerable and exposed to any untrusted network, such as a vendor-accessible extranet or the public Internet should be considered suspect and isolated and reviewed by a security team due to the ability for “worms”, or automated infect-and-spread scripts that exploit this vulnerability, to quickly affect vulnerable systems in an unattended manner. Any affected devices that contain private keys should have those keys treated as compromised and have those keys reissued per your company’s information security policies regarding key management procedures.
Next Steps
All financial institutions should immediately review their own environments to determine that no other third-party systems that are involved in serving or securing the online banking experience, or any other publicly-available services, are running vulnerable versions of the Bash shell. Any financial institution that provides any secure services with Linux or *nix variants running a vulnerable version of the Bash shell could be at risk no matter what their vendor mix. If any vulnerable devices are found, they should be treated as suspect and isolated per your incident response procedures until they are validated as not affected or remediated. All financial institutions should immediately and thoroughly review their systems and be prepared to change passwords on and revoke and reissue certificates with private key components stored on any compromised devices.For further reading on this issue:
When to Ride the Service Bus
One of the great things about adding new, senior talent to a storied team working on a large, complex, and successful enterprise application solution is the critical technical review that results in a lot of “why did/didn’t you do it this way?” questions. You have two options for responding to those questions - ignore or passively dismissing them, or taking the questions seriously as a challenge to prove out why you would make a decision you and your team made 5 years ago the same if you had to consider for the first time today, in today’s frameworks, development methodologies, and the current team makeup and skills inventory. If you choose to dismiss these opportunities to critically review your prior decisions, it says a lot about your management style, general appreciation of technology and response to its change, and positions your team to take a reactionary, defensive posture to architecture rather than create a team that honors a proactive, continuous improvement perspective. Far more interesting too are those questions that ask why the system is architected in a general way, rather than a theological debate on whether a particular technology component choice is superior to all over or one’s preferred/familiar choice.
The particular question the new engineer asked was, “Why aren’t we using a service bus?” Instead of answering him directly, I figured this as a good opportunity to explore the previous decision we made that not only did not include an enterprise service bus (ESB) in the original design, but rejected its inclusion when it was strongly suggested by our first customer because they were standardizing on a service bus-centric architecture themselves. The primary advantage of a service bus is to layer an abstraction across heterogeneous systems by implementing a centralized communication mechanism across components. By applying this architectural model, you can get some key benefits including orchestration, queuing to handle intermittent component availability, and extensibility points for message routing to alter dispatch logic or transform messages. Implementing the service bus pattern requires some kind of adapter to be written for each component of the system, either as a local modification to each component or by choosing to standardize on a communication channel provided by the ESB. Even in the latter, usually some minor accommodation is required to allow the ESB to receive and encapsulate the native message for delivery to the destination component. Our first customer was a notable player in the community banking market, and was productizing multiple new SaaS-based web applications that depended on data feeds coming from many different customers. In their scenarios, data was consumed by one application, parsed, and delivered to other applications, which in turn may have created additional data feeds for other products, in a cyclic communication/dependency non-directed graph. Each application was developed by different teams, and there was no unified technology stack adoption - some teams were developing on EJB and Flex, others were pure .NET, and teams generally had the discretion to choose whatever they could argue would solve the job, without a strong technology leader looking to unify the stack for similar applications that delivered CMS and pseudo-online banking functionality using a common input data set.
For this customer, ESB was a solution to a problem - their choices lead to a highly concurrent development process with multiple independent teams - but also supported connecting a heterogeneous environment of interdependent components, each of which accomplished limited objectives. This organization was running red-hot - developing ancillary products to a highly engaged and fanatic client base of community banks, where their limiting factor was their speed of innovation and delivery. By agreeing on a common communication mechanism that ESB could provide, there was something, albeit low-level, to which all teams agreed. In the ‘controlled agile chaos’ they found themselves in, the abstraction bought them flexibility to adapt changing business requirements using orchestration. In theory anyway - they ended up moving much slower than they anticipated, but this wasn’t the fault of ESB. ESB solves two classes of problems. The first is the common use case of large, disparate enterprises looking to marry systems established from the dawn of client-server architectures to the newest Node.js hotness, without having to bend the will of any particular system to the communication conventions of any other, which may prove impossible if both systems are proprietary. This is a common use case for BizTalk, especially in the financial sector. All the other benefits you can name off from a service bus architecture are really secondary advantages to this key objective. The second is the use case that any layer of indirection provides: an abstraction you can use to increase the speed of development when requirements are incomplete or prone to pivot. In each case, you invest in a layer to reduce the cost of future change. This particular customer chose NServiceBus as their message-oriented middleware. We seriously evaluated both the general architectural concepts ESB as well as the particular technology they suggested and came up with a definitive ‘no’ to that choice. While it made a lot of sense for our customer, it did not make sense for us because:
- We did not require guaranteed event handling. Our system connected to a system of record that provided transactional consistency, and virtually all state changes were initiated by users through a web browser. A timeout was preferable to queued command handling system because of the possibility of duplicate transactions that frustrated users may initiate, not realizing their requests were queued. Second, our interconnected systems did not provide guaranteed event handling, so the guaranteed provided by the ESB would now be honored end-to-end. Third, we are using the Windows Identity Foundation with sliding time expirations end-to-end from the user's browser through the lowest layer of service components, which doesn't bode well for delayed delivery situations, even if the user was willing to wait.
- We do require transformation, but not orchestration between our components. Our system features adapter-based design to allow multiple types of endpoints to be serviced by a single service implementation for those portions that may need to connect to a different type of third-party system through a provider model implementation loaded by dependency injection. We could have chosen to use ESB for this piece, however, we perceived the long-term maintenance cost of multiple providers with the party-specific transformation logic to be lower than maintaining those transforms in ESB scripting or adapters. In reviewing this perception today, I believe it was still the right decision because is allowed for us to unit-test our transformation logic without including the ESB.
- An ESB is a single point of failure that would independently need to scale for load exponentially proportional to the number of service interconnects in our solution, and would add some amount of latency between each. Because online banking is a mission-critical, customer-facing solution, it cannot have SPOF's in any portion of the architectural design. The SPOF nature of an ESB can be mitigated in multiple ways, but we felt that was at least two layers of complexity we could solve in other, simpler ways.
- All middleware increases the Mean Time Between Failures (MTBF). This is not a risk specific to ESB, but of any layer added to a system. If you add an ORM, IOC, ESB, or even a logging aspect, something can go wrong with them. Each component has some small, but measurable failure rate, and when inserted into the communication chain between all components, its reliability of 99.999% still contributes to a reduction in the overall reliability of a serial system. This is where the KISS principle shines - complexity creates unreliability, so all complexity must generate a compelling benefit in excess of its potential to fail.
- We wanted our application layer to be the platform, we did not want ESB to be the platform. This was a business case / competitive advantage decision that we wanted to build as a feature of our system that the same services layer that supported our front-end user interfaces was also an open and extensible platform upon which our clients could integrate to, which would increase the overall value proposition of online banking not only as a sticky end-user experience, but also as a value proposition to capitalize on our solution as the middleware that marries together all the disparate systems within a financial institution, which ultimately online banking does like no other piece of technology within a bank or credit union. We felt that by positioning everything behind an ESB, the perceived value of our technology piece would be lessened without additional client education.
- MSMQ made us feel dirty enough, and we did not want to mandate it for each component because it was in 2009 and still is relatively difficult to debug, and lately we have learned, queues do not work well with used with Layer 7 network load balancing. The new hotness of 0MQ wasn't around then, and while RabbitMQ was, it was arguably not production ready by that time. For us, production-ready isn't just whether a component is capable, but whether it will have general acceptance from the IT departments of our large clients - many newer technologies that are FOSS or from vendors without an establish track record require a 'sale' and buy-in during due diligence, long before ink is applied to a contract. Even if they were options for the ESB queuing mechanism, they would not resolve the larger aforementioned concerns.
- At the time we made this choice, AMQP was an amorphous draft that did not solidify until later. The lack of a vendor-independent protocol between components and an ESB made the choice to utilize an ESB subject to vendor lock-in, which we were not willing to tolerate for such a critical component.
- Because our product was both the end-user experience and the middleware we were writing, we felt strongly that the application protocol should provide descriptive metadata and support fast client proxy generation using .NET-based tools. REST support was archaic at best (HttpRequest anyone?) in .NET 3.5, and to this day, consuming SOAP services is intrinsically more verbose in C# and VB.NET (HttpClient) than consuming REST or AMQP services due to a lack of better library and integrated language support for it. Looking back on this, with a large amount of iterative change we went through from ideation to Version 1.0 of our solution, we could not have moved as fast without a fast way to regenerate proxies that would cause build failures to alert us of service operation signature changes -- tracking these down at runtime (REST) or having to debug a secondary system (ESB) to find these would have bogged down our delivery timelines.
- A lesser concern was we felt that tracing SOAP messages, while definitely more difficult than REST, would be more difficult debug any issues in AMQP or other ESB encapsulation protocols than inspecting SOAP envelopes with built-in WCF tools already present in the .NET development stack.
The central design decision we made was that ESB’s provide some great features and that ties you into an ESB, but if we could get those features another way that was just as convenient or more so, we’d prefer the plug-and-play flexibility of leveraging existing solutions for components such as caching and load balancing in the environment our solution operates, or pick those pieces ad-hoc for those concerns rather than pick the best omnibus solution and work around any specific shortcomings for any one of them. In reviewing the current industry literature and blog posts and looking at general trends, it would seem our decision not to marry our solution is generally the path many take when not required to integrate legacy systems as part of an orchestration chain or when using non-HTTP based transport mechanisms. If you’re using one, hopefully it’s for a good and necessary reason! For us, though, we decided not to hop on a service bus that could take us somewhere we already arrived.
* As an aside, we actually did end up rolling our own small "ESB" as a TCP port multiplexer that queues and portions out connectivity to a socket-based, legacy third-party component that has no listener back-queue and no port concurrency, highly unusual for a server process. Each connection consumes the port fully for the duration of the short transaction, and we had to write a way to buffer M number of requests and hand them off to (M-N) number of available ports as they became available,in a specialized type of producer-consumer problem. In hindsight, this was an opportunity to use an ESB, but in our case, we only required message routing and load leveling, and in a few hundred lines of code, we implemented what we needed for this particular third-party system what would have taken us far longer to do as our first time using an ESB. That being said, should we encounter this with another vendor, it would make sense to review using an ESB for this type of functionality in the future.
Scaling Enterprise Database-Bound Applications: I/O
Optimizing Slow Accesses
While most software developers like to think of themselves as computer scientists in the purest sense of the term, with job duties that would include intimately understanding and exploiting efficiencies of the x64 processor platform, optimizing that critical-path O(log n) algorithm to perform in O(log log n) time, and other acts of mathematical creativity and scientific application, that's not what most software developers do (or should be doing if they are).Most software developers are building business (retail B2C, B2B API’s, or LOB’s), not scientific applications – and that means most are developing I/O-bound, not CPU-bound applications. Specifically, most business applications are creative user or application programming interfaces around relatively mundane CRUD operations on a data store. Even more complex applications that perform data synchronization or novel calculations of co-variance or multivariate regression consume maybe 5% of their time crunching data, and the other 95% of the time retrieving and sending it on.
So, when you design an enterprise application and get past the ideation phase and start scaling out your next-generation game-changing application from a cute demo to a serious and robust application serving millions of requests, why would you bother with refactoring your string concatenation in loops into string builders, aiming for zero-copy, or optimizing for CPU performance? You should not and you should: You should not be optimizing for CPU performance, unless you have optimized all your slow accesses away – and you should be optimizing for CPU performance because hopefully you’ve already squeezed all the blood out of the I/O turnip you can.
But you haven’t. I know you haven’t. You know you haven’t if you are being honest. Have you ever looked at your database queries per second for specific-entity queries? For instance, let’s say a user logs into your enterprise application, and a service on your application tier needs to retrieve the record of a user. That service might call another service to make a record of the user’s login. Then the user navigates to another page in your application 60 seconds later. How many times did any component of your system retrieve the user by their unique identifier? If the answer is, “I don’t know”, you haven’t scratched the surface of scaling an enterprise application, much less my most important axiom of doing so: “Don’t Repeat Requests”.
This is a lot harder than you might think, because enterprise web application development lends itself to repeating requests, and it is not an easy problem to solve, because you are essentially creating state on an application tier for a web tier that hosts a stateless HTTP application protocol. When functionality is segregated into multiple services with distinct responsibilities, there is some duplication of I/O access that happens to fulfill a request that is unavoidable. Unless you and everyone on your team completely understands this disjoint and works collectively to design solutions that do not repeat requests, you will repeat requests as part of the natural design of any system.
Caching Isn't a Magic Bullet, But It Is a Bullet
If you thought this post was going to end at "implement second-level caching on your ORM of choice", you're wrong, but you should be doing that for sure. This is usually as easy installing a caching server like Couchbase, configuring your ORM in a few lines of code or configuration files, and wala - you are still repeating your requests, but this time, answering your repeated requests will be a lot faster than any SSD-backed database server will ever be.(I say ‘usually’, because this depends on how you’re using your ORM. If you use your ORM as an expensive way to execute stored procedures, your ORM will be at best a pass-through for database methods and will not give you the benefit of entity caching that could be reused for multiple queries that include that entity as a result. As with all caching, YMMV depending on how you have designed your layers.)
Once you enable caching, measure. Measure how many times you ask for that user record when a user logs in and performs some actions over time. You’ll be amazed that when you view this from a database request level, you will still be asking for the same user over and over again as long as not every component is using the cache for database entities with a consistent cache key. It’s very hard to get right, both from an application configuration and a caching server configuration perspective – do not assume, but do measure.
Remember, the most important thing to remember is not to get really fast answers to your repeated questions, but stop asking the same questions over and over again! Caching at the ORM is your tourniquet to stop the bleeding of your performance into database I/O buffers and wait times, but caching at the inter-component request level is critical. Let’s say you have an enterprise web application that retrieves a forecast for a city for a given period of time. The web client makes the request for the locale and date range to your application tier, which translates that into queries of whatever entities comprise your data model. With ORM second-level caching in effect, the next request for the same locale and date range will not ask the question of the database this time, but the answer will come instead from the second-level cache… but stop right there. The question was asked again at a higher level, you’re just answering it in a more intelligent way the second time around.
Enterprise web applications need to cache the responses of service requests using a cache key that accounts for the parameters of the request. Hopefully your web application faithfully implements a repository pattern, and if so, you implement a cache into this layer to eliminate repeated requests to the service layer to start with. This is not easy. This is hard because your ORM’s database caching is likely a black box implementation of complex cache expiration logic that performs all sorts of clever tricks to know when an entity has become ‘dirty’ and needs to be retrieved again from the underlying database rather than use the cached copy. If you’re developing business applications, you’re probably not accustomed to being clever at this level, and you will need to spend the time to implement this manually throughout your repository pattern (unless you thought ahead and can add caching as an aspect) and to bust your caches.
Challenges of Busting Caches
Busting your own caches - that is, invalidating a cached entry when you have reason to know the cached version is no longer good - is one of the trickiest things to get right in this stage of Don't Repeat Requests. Let's take a service method called GetUser() that returns the user and an object graph of some interesting things that cover multiple data entities from the database. At the web tier, we start caching that call when we make it so subsequent calls from the web tier won't even request this from the service while its in cache. But what else could change the User object in the database? If the user themselves can, then that's easy enough to know to bust a cache on a User repository .Save() method, but if other unrelated processes can, such as say, a back-end service process that bulk-updates users for some reason, then this gets more challenging to ensure you've identified all the paths that could invalidate the data and make sure each have access to bust the cache for the GetUser() response as cached by the web tier as well as the User entity as represented in any other request (think GetUser(), GetUsersByWhatever(), and all the other variants that may also need cache busting). When GetUser() actually includes data sourced from other entities, you have to think about the dependent object graph in the data model and ensure you've accounted for these as well. You just have to consider but not handle this recursive analysis for deep object graphs -- it only matters as much as it matters for the user experience.This kind of task must be reserved for the architects and most senior engineers who know your system design and inter-dependencies inside and out to avoid data consistency errors. A key point is as long as all data validation logic is performed at the lowest layer under any custom caching work you perform, data consistency errors will at worst create a poor user experience. If you don’t - if you have critical client-side validation that is not mirrored under caching on the service-side of your architecture, you have bigger security risks and other problems than caching, but this will definitely impede your ability to deploy service request caching and scale your application.
Caching From Within
Within any area of your application, beware anti-patterns that repository patterns can create. If you author MethodA() that calls MethodB() that calls MethodC(), all of which individually call UserRepository.GetUser(), then you're recursively repeating yourself. Repository patterns are nice because they reduce the repetitive session and connection management functions involved with making a web service or database call, but they make it easy to forget that they're very, very heavy methods.Do not be afraid to accumulate. Do not be afraid to pass object graphs through method parameters to save I/O. You could think about the call stack as your cache here, and while you shouldn’t load it up as an unnecessarily heavy omnibus object to pass around to any method, and while you definitely should not front-load all your I/O before calling a logical method chain before conditional logic or exception management could make some of the calls unnecessary, intelligently design methods so they don’t take the smallest parameter set possible, but create the best scalability when working in concert.
Caching Outside Your Boundaries
If you're writing enterprise web applications for a product that is not dying or decaying, you're writing it in HTML5 today. And if your web design isn't from a Frontpage 98 template, you're probably using AJAX requests either to improve user experiences and reduce perceived page load times or maybe you've gone whole-hog into an SPA design. With HTML5 and a relatively modern web browser, you have LocalStorage. Use LocalStorage.You should be using LocalStorage to cache and bust non-error responses to AJAX requests to your web services and REST endpoints. Just because you’ve thinned out the pipes from services to the database and from the web tier to the services tier, why stop there? Why continue to allow browsers to repeat requests to your web tier as a user moves back and forth between areas or pages? If you rest on your laurels on a job-well-done, but still repeat unnecessary I/O queries at a level higher up in the chain, then you’ve made your application more performant but not truly scalable – you’ve just shifted the blame.
The F5 Test
I propose what I will call the "F5 Test" for scalability. When you've cached all you can cache, and every layer is implementing the "Don't Repeat Requests" mantra, open up your database profiler and your Couchbase cache hit dashboard. Log into your application's dashboard, reporting, or whatever page you want to test, then clear your profiler and cache hit counters. Press F5. You should see very, very little activity on a reload, and you should be able to explain what you do see.But, for what you do see, justify each and don’t make excuses for yourself:
- If your dashboard makes repeated requests because you feel it "always needs to be up-to-date", then you're doing it wrong. Cache and use server-side events to refresh your cached copy.
- If you load a user object to determine whether they have a login session, then do you have a good reason for not using browser evidence such as a signed SAML assertion to validate a session instead of using a database lookup to verify a user exists and is authorized?
- If you see something you can't explain, investigate. I wish this was as obvious as it is intuitive, but many times software developers will be content with an arbitrary improvement (I made 232 database calls on login go down to 47) rather than to do the homework to find out why 47 isn't 5. Maybe there are 42 extraneous requests made by a service that doesn't use the cache even though you thought it did. Maybe one of those 42 requests causes database locking escalations that won't scale with load.
Oh yeah, and optimize query plans. This is important work, but it’s not the outer-most layer of the onion. It’s important to remember the difference between scalability and performance:
- Performance should be determined by the user experience from dispatching of the request to final rendering of the result to the user in their browser. Performance is not "how much CPU does the system use under load" - that is resource utilization, though many people use performance for both concepts.
- Scalability is two-fold: How many users can I get a certain level performance on a certain hardware basline (scaling up), and can I and how often will I have to throw money at more hardware to handle more users at the same level of performance (scaling out)?
- Improving performance may or may not improve scalability
- Improving scalability rarely improves performance
- Management will not understand the difference
It’s work you should do, but you shouldn’t do it first for scalability reasons.
After-Thoughts: Don't Report Stupid Results
Building highly-scalable applications from the ground up with a large team is impossible. You iterate scalability just as you iterate product features. Actually, hopefully you iterate scalability tasks along with user stories, but in actuality, complex enterprise web applications are usually architected with the best of intentions with intelligent designs, but reach a breaking point at some level of load on some hardware platform that cause a stop-drop-and-roll effort to improve the scaling up and out of an application. Companies with deadlines and tight deliverable schedules don't consistently evaluate and factor the required work to make and keep an application scalable over time into iterations. If someone tells you differently, they're probably in sales and they're definitely lying.That being said, software developers, do not succumb to the pressure to deliver scalability improvements by reporting true but irrelevant statistics to management.
- "I sped up database calls for GetUser() by 300%!" suggests anything that gets a user should see a three-fold improvement in speed. If that database call is 1% of the login process time, then it will have no material impact.
- "I reduced the size of page requests from 500K to 250K!" means "I doubled the performance or scalability of the application" to management, but in reality, it means neither.
- "I found a problem between ServiceA and ServiceB and cut out three extraneous calls between them!" means nothing to anyone. Did you remove three calls that are made once an hour by a batch process, or three calls made for every user login? What was the impact of those calls on performance and scalability before and after the optimization?
- "ServiceA is a big problem and has a lot of errors. I removed a lot of exceptions on ServiceA. Exceptions cause performance problems." is problematic on several levels. Why were the exceptions being thrown? Did removing them fix or just sweep a real problem under the carpet? If it was justified, what improvement did it have on the overall system?
While most devs don’t do scientific computing, scaling applications is an empirical task that demands meaningful measurement in a realistic testing context. There is spec document or product owner guidance on improving scalability: you must treat it as a scientific experiment. Observe, hypothesize, have a control (the pre-change measurement), experiment, report data. If you fail to discretely value each change with before and after metrics, you’re just shooting in the dark. Cowboy coding gets teams into scalability messes, not out of them.
Especially, though, don’t give updates on enhancements that you cannot verify improve scalability with before and after numbers. If you fix a problem that doesn’t improve the overall system scalability, which happens often in scalability improvement iterations, highlighting your accomplishments when there is no observable improvement suggest you are either ineffective or not working on the right items. Worse, in crunch times, providing such updates gives a false sense of accomplishment to management. Improving scalability, or performance for that matter, has no done-state. But providing meaningless accomplishment notes to management will accelerate the sense of “we’re done enough”, when in fact, you may not have even identified the most significant issue to your scalability for your particular scenario.
And if you haven’t, let me do it for you: You’re repeating your requests. Trust me on that one. :-)
A Brief Introduction to Part-of-Speech Tagging
A field of computer science that has captured my attention lately is computational linguistics – the inexact science of how to get a computer to understand what you mean. This could be something as futuristic as Matthew Broderick’s battle with the WOPR, or with something more practical, like Siri. Whether it be text entered by a human into a keyboard or something more akin to understanding the very unstructured format of human speech, understanding the meaning behind parsed words is incredibly complex – and to someone like me – fascinating!
My particular interest as of late is parsing – which from a linguistic perspective, means the breaking down of a string of characters into words, their meanings, and stringing them together in a parse tree, where the meanings of individual words as well as the relationships between words is composed into a logical construct that allows higher order functions, such as a personal assistant. Having taken several foreign language classes before, then sitting on the other side of the table as an ESL teacher, I can appreciate the enormous ambiguity and complexity of any language, and much more so English among Germanic languages, as to creating an automated process to parse input into meaningful logical representations. Just being able to discern the meaning of individual words given the multitude of meanings that can be ascribed to any one sequence of characters is quite a challenge.
Parsing Models
Consider this: My security beat wore me out tonight.In this sentence, what is the function of the word beat? Beat functions as either a noun or a verb, but in this context, it is a noun. There are two general schools of thought around assigning a tag as to what part of speech (POS) each word in a sentence functions as – iterative rules-based methods and stochastic methods. In rules-based methods, like Eric Brill’s POS tagger, a priority-based set of rules that set forth language-specific axioms, such as “when a word appears to be a preposition, it is actually a noun if the preceding word is while”. A complex set of these meticulously constructed conditions is used to refine a more course dictionary-style assignment of POS tags.
This continues to be a good candidate for doctorial theses in computer science disciplines.. papers that have caused me to lose too much sleep as of late.
Parsing Syntax
Even describing parts of speech can be as mundane as your elementary school grammar book, or as rich as the C7 tagset, which provides 146 unique ways to describe a word's potential function. While exceptionally expressive and specific, I have become rather fond of the Penn Treebank II tagset, which defines 45 tags that seem to provide enough semantic context for the key elements of local pronoun resolution and larger-scale object-entity context mapping. Finding an extensively tagged Penn Treebank corpus proves difficult, however, as it is copyright by the University of Pennsylvania, distributed through a public-private partnership for several thousand dollars, and the tagged corpus is almost exclusively a narrow variety of topics and sentence structures -- Wall Street Journal articles. Obtaining this is critical to use as a reference check for writing a new Penn Treebank II part-of-speech tagger, and it prevents the construction of a more comprehensive Penn-tagged wordlist, which would be a boon for any tagger implementation. However, the folks at the NLTK has provided a 10% free sample under Fair Use that has provided somewhat useful for both checking outputs in a limited fashion, but also for generating some more useful relative statistics about relationships between parts of speech within a sentence.
Future Musings
My immediate interest, whenever I get some free time on a weekend (which is pretty rare these days due to the exceptional pace of progress at our start-up), is pronoun resolution, which is the object of this generation's Turing Test -- the Winograd Schemas. An example of such a challenge is to get a machine to answer this kind of question -- Joe's uncle can still beat him at tennis, even though he is 30 years older. Who is older? This kind of question is easy for a human to answer, but very, very hard for a machine to infer because (a) it can't cheat to Google a suitable answer, which some of the less impressive Turing Test contestant programs now do, and (b) it requires not only the ability to successfully parse a sentence into its respective parts of speech, phrases, and clauses, but it requires the ability for a computer to resolve the meaning of a pronoun. That's an insanely tough feat! Imagine this:“Annabelle is a mean-spirited person. She shot my dog out of spite.”
A program could infer “my dog” is a dog belonging to the person providing the text. This has obvious applications in the real world if you can do this, and it has been done before. But, imagine the leap in context that is exponentially harder to overcome when resolving “She”. This requires not only an intra-sentence relationship of noun phrases, possessive pronouns, direct objects, and adverbial clauses, but it also requires the ability to carry context forward from one sentence to the next, building a going “mental map” of people, places, things – and building a profile of them as more information or context is provided. And, if you think that’s not hard enough to define .. imagine the two additional words appended on to this sentence:
, she said.
That would to a human indicate dialog, which requires a wholly separate frame of Inception-style reference between contextual frames. The parser is reading text about things which is actually being conveyed by other things – both sets of frames have their own unique, but not necessarily separate, domains and attributes. I’m a very long-way off from ever getting this diversion in my “free time” anywhere close to functioning as advertised… but, then again, that’s what exercises on a weekend are for – not doing, but learning. :)
Robustness in Programming
(For my regular readers, I know I promised this post would detail ‘a method by which anyone could send me a message securely, without knowing anything else about me other than my e-mail address, in a way I could read online or my mobile device, in a way that no one can subpoena or snoop on in between.’ A tall order, for sure, but still something I am working to complete in an RFC format. In the meantime…)
I have the benefit of supporting an engineering group that is seeing tremendous change and growth well past ideation and proof of concept, but at the validation and scaling phases of a product timeline. One observation I’ve made about the many lessons taught and learned as part of this company and product growth spurt have been the misapplication of the Jon Postel’s Robustness Principle. Many technical folks are at least familiar with, but often can quote the adage: “Be conservative in what you do, be liberal in what you accept from others”. Unfortunately, like many good pieces of advice, this is taken out of context when it relates to software development.
First off, robustness, while it sounds positive, it not a trait you always want. This can be confusing for the uninitiated, considering antonyms of the word include “unfitness” and “weakness”. On a macro-scale, you want a system to be robust; you a product to be robust. However, if you decompose an enterprise software solution into its components, and those pieces into their individual parts, the concerns do not always need to, and in some cases should not, be robust.
For instance, should a security audit log be robust? Imagine a highly secure software application that must carefully log each access attempt to the system. This system is probably designed so that many different components of the system can write data to this log, and imagine the logging system is simple and writes its output to a file. If this particular part of the system were robust, as many developers define it, it must, as well as possible, attempt to accept and log any messages posted to it. However, implemented this way, it is subject to CRLF attacks, whereby a component that can connect to it and insert a delimiter that would allow it to add false entries to the security log. Of course, you developers say, you need to do input checking and not allow such a condition to pass through to the log. I would go much further and state you must be as meticulous as possible about parsing and throwing exceptions or raising errors for as many conditions as possible. Each exception that is not thrown is an implicit assumption, and assumptions are the root cause of 9 out the OWASP Top 10 vulnerabilities in web applications.
Robustness can, and is often, an excuse predicated by laziness. Thinking about edge cases and about the assumptions software developers make with each method they write is tedious. It is time consuming. It does not advance a user story along its path in an iteration. It adds no movement towards delivering functionality to your end users. Recognizing and mitigating your incorrect assumptions, however, is an undocumented but critical requirement for the development of every piece of a system that does store, or may ever come in contact with, protected information. Those that rely on the Robustness Principle must not interpret “liberal” to mean “passive” or “permissive”, but rather “extensible”.
In the previous example I posited about a example logging system, consider how such a system could remove assumptions but still be extensible. The number and format of each argument that comprises a log entry should be carefully inspected - if auditing text must be descriptive, then shouldn’t such a system reject a zero or two-character event description? While information systems should be localizable and multilingual, shouldn’t all logs be written in one language and any characters that are not of that language omitted and unique system identifiers within the log languages' character set used instead? If various elements are co-related, such as an account number and a username, shouldn’t they be checked for an association instead of blindly accepting them as stated by the caller? If the log should be chronological, shouldn’t an event specified in the future or too far in the past be rejected? Each of these leading questions exposes a vulnerability a careful assessment of input checking can address, but which is wholly against most developers' interpretations of the Robustness Principle.
However, robustness is not about taking whatever is given to you, it is about very carefully checking what you get, and if and only if it passes a litany of qualifying checks, accepting it as an answer to an open-ended question, rather than relying on a defined set of responses, when possible. A junior developer might enumerate all the error states he or she can imagine in a set list or “enum”, and only accept that value as valid input to a method. While that’s a form of input checking, it is wholly inextensible, as the next error state any other contributor wishes to add will require a recompile/redeploy of the logging piece, and potentially every other consumer of that component. Robustness need not require all data be free-form, it must simply be written with foresight.
Postel, wrote his “law” with reference to TCP implementations, but he never suggested that TCP stack implementers liberally accept TCP segments with such boundless blitheness that they infer the syntax of whatever bits they received, but rather, they should not impose an understanding of the data elements that were not pertinent to the task at hand, nor enforce one specific interpretation of a specification upon upstream callers. And therein lies my second point – robustness is not about disregarding syntax, but about imposing a convention. Robust systems must fail as early and as quickly as possible when syntax, especially, has been violated or cannot be accurately and unambiguously interpreted, or if the context or state of a system is deemed to be invalid for the operation. For instance, if a receives a syntactically valid message but can determine the context is wrong, such as a request for information from a user who lacks an authorization to that data, every conceivable permutation of invalid context should be checked, not fail to consider each in a blasé fashion to leave room for a future feature that may, someday, require an assumption made in the present, if it is ever to be developed. This crosses another threshold beyond extensibility to culpable disregard.
In conclusion, building a robust system requires discretion in interpretation of programming “laws” and “axioms”, and an expert realization that no one-liner assertions were meant by their authors as principles so general to apply to every level of technical scale of the architecture and design of a system. To those who would disagree with me, I would say, then to be “robust” yourself, you have to accept my argument. ;)
When All You See Are Clouds... A Storm Is Brewing
The recent disclosures that the United States Government has violated the 4th amendment of the U. S. Constitution and potentially other international law by building a clandestine program that provides G-Men at the NSA direct taps into every aspect of our digital life - our e-mail, our photos, our phone calls, our entire relationships with other people and even with our spouses, is quite concerning from a technology policy perspective. The fact that the US Government (USG) can by legal authority usurp any part of our recorded life - which is about every moment of our day - highlights several important points to consider:
- Putting the issue of whether the USG/NSA should have broad access into our lives aside, we must accept that the loopholes that allow them to demand this access expose weaknesses in our technology.
- The fact the USG can perform this type of surveillance indicates other foreign governments and non-government organizations likely can and may already be doing so as well.
- Given that governments are often less technologically savvy though much more resource-rich than malevolent actors, if data is not secure from government access, is it most definitely not secure from more cunning hackers, identity thieves, and other criminal enterprises.
But before proposing some solutions, we must consider:
How Could PRISM Have Happened in the First Place?
I posit an answer devoid of politics or blame, but on an evaluation of the present state of Internet connectivity and e-commerce. Arguably, the Internet has matured into a stable, reliable set of services. The more exciting phase of its development saw a flourishing of ideas much like a digital Cambrian explosion. In its awkward adolescence, connecting to the Internet was akin to performing a complicated rain dance that involved WinSock, dial-up modems, and PPP, sprinkled with roadblocks like busy signals, routine server downtime, and blue screens of death. The rate of change in equipment, protocols, and software was meteoric, and while the World Wide Web existed (what most laypeople consider wholly as “the Internet” today), it was only a small fraction of the myriad of services and channels for information to flow. Connecting to and using the Internet required highly specialized knowledge, which both increased the level of expertise of those developing for and consuming the Internet, while limiting its adoption and appeal - a fact some consider the net’s Golden Age.
But as with all complex technologies, eventually they mature. The rate of innovation slows down as standardization becomes the driving technological force, pushed by market forces. As less popular protocols and methods of exchanging information give way to young but profitable enterprises that push preferred technologies, the Internet became a much more homogeneous experience both in how we connect to and interact with it. This shapes not only the fate of now-obsolete tech, such as UUCP, FINGER, ARCHIE, GOPHER, and a slew of other relics of our digital past, but also influenced the very design of what remains – a great example being identification and encryption.
For the Internet to become a commercializable venue, securing access to money, from online banking to investment portfolio management, to payments, was an essential hurdle to overcome. The solution for the general problem of identity and encryption, centralized SSL certificate authorities providing assurances of trust in a top-down manner, solves the problem specifically for central server webmasters, but not for end-users wishing to enjoy the same access to identity management and encryption technology. So while the beneficiaries like Amazon, eBay, PayPal, and company now had a solution that provided assurance to their users that you could trust their websites belonged to them and that data you exchanged with them was secure, end-users were still left with no ability to control secure communications or identify themselves with each other.
A final contributing factor I want to point out is that other protocols drifted into oblivion, more functionality was demanded over a more uniform channel – the de facto winner becoming HTTP and the web. Originally a stateless protocol designed for minimal browsing features, the web became a solution for virtually everything, from e-mail (“webmail”), to searching, to file storage (who has even fired up an FTP client in the last year?). This was a big win for service providers, as they, like Yahoo! and later Google, could build entire product suites on just one delivery platform, HTTP, but it was also a big win for consumers, who could throw away all their odd little programs that performed specific tasks, and could just use their web browser for everything – now even Grandma can get involved. A more rich offering of single-shot tech companies were bought up or died out in favor of the oligarchs we know today - Microsoft, Facebook, Google, Twitter, and the like.
Subtly, this also represented a huge shift on where data is stored. Remember Eudora or your Outlook inbox file tied to your computer (in the days of POP3 before IMAP was around)? As our web browser became our interface to the online world, and as we demanded anywhere-accessibility to those services and they data they create or consume, those bits moved off our hard drives and into the nebulous service provider cloud, where data security cannot be guarenteed.
This is meaningful to consider in the context of today’s problem because:
- Governments and corporate enterprises were historically unable to sufficiently regulate, censor, or monitor the internet because they lacked the tools and knowledge to do so. Thus, the Internet had security through obscurity.
- Due to the solutions to general problems around identity and encryption relying on central authorities, malefactors (unscrupulous governments and hackers alike) have fewer targets to influence or assert control over to tap into the nature of trust, identity, and communications.
- With the collapse of service providers into a handful of powerful actors on a scale of inequity on par with a collapse of wealth distribution in America, there exist now fewer providers to surveille to gather data, and those providers host more data on each person or business that can be interrelated in a more meaningful way.
- As information infrastructure technology has matured to provide virtual servers and IaaS offerings on a massive scale, fewer users and companies deploy controlled devices and servers, opting instead to lease services from cloud providers or use devices, like smartphones, that wholly depend upon them.
- Because data has migrated off our local storage devices to the cloud, end-users have lost control over their data's security. Users have to choose between an outmoded device-specific way to access their data, or give up the control to cloud service providers.
There Is A Better Way
As a good example, if you want to send a secure e-mail message today, you have three categorical options to do so:
- Implicitly trust a regular service provider: Ensure both the sender and the receiver use the same server. By sending a message, it is only at risk while the sender connects to the provider to store it and while the receiver connects the provider to retrieve it. Both parties trust the service provider will not access or share the information. Of course, many actors, like Gmail, still do.
- Use a secure webmail provider: These providers, like Voltage.com, encrypt the sender's connection to the service to protect the message as it is sent, and send notifications to receivers to come to a secure HTTPS site to view the message. While better than the first option, the message is still stored in a way that can be demanded by subpoena or snooped inside the company while it sits on their servers.
- Use S/MIME certificates and an offline mail client: While the most secure option for end-to-end message encryption, this cumbersome method is machine-dependent and requires senders and receivers to first share a certificate with each other - something the average user is flatly incapable of understanding or configuring.
Doing Your Due Diligence on Security Scanning and Penetration Testing Vendors
All too often, development shops and IT professionals become complacent with depending on packaged scanning solutions or a utility belt of tools to provide security assurance testing of a hosted software solution. In the past five years, a number of new entrants to the security evaluation and penetration testing market have created some compelling cloud-based solutions to perimeter testing. These tools, while exceptionally useful for a sanity check of firewall rules, load balancer configurations, and even certain industry best practices in web application development, are starting to create a false sense of security in a number of ways. As these tools proliferate, infrastructure professionals are becoming increasingly dependent upon their handsomely-crafted reporting about PCI, GLBA, SOX, HIPPA, and all the other regulatory buzzwords that apply to certain industries. If you’re using these tools, have you considered:
Do you use more than one tool? If not, and you should, is there any actual overlap between their testing criteria?
There is a certain incestuous phenomenon that develops in any SaaS industry that sees high profit margins: entrepreneurs perceive cloud-based solutions as having a low barrier to entry. This perception drives new market entrants to cobble together solutions to compete for share in the space. But are these fly-by-night competitors competitively differentiated from their peers?Sadly, I have found in practical experience this not to be the case. Too many times have I have enrolled in a free trial of a tool or actually shelled out for some hot new cloud-based scanning solution to find at best only existing known vulnerabilities are duplicatively reported by this new ‘solution’, with only false positives appearing as the ‘net new’ items to bring to my attention. Here in lies the rub – when new entrants to this market create competing products, there is an iterative reverse engineering that goes on – they run existing scanning products on the market against websites, check to see those results, and make sure they develop a solution that at least identifies the same issues.
That’s not good at all. In any given security scan, you may see, perhaps, 20% of the total vulnerabilities a product is capable of finding show up as a problem in a scan target. Even if you were to scan multiple targets, you may only be seeing mostly the same kinds of issues in each subsequent scan. Those using this as a methodology to build quick-to-market security scanning solutions are delivering sub-par offerings that may only identify 70% of the vulnerabilities other scanning solutions do. eEye has put together similar findings in an intriguing report I highly recommend reading. Investigating the research and development activities of a security scanning provider is an important due diligence step to make sure when you get an “all clear” clean report from a scanning tool, that report actually means something.
How do you judge your security vendor in this regard? Ask for a listing of all specific vulnerabilities they scan for. Excellent players in this market will not flinch at giving you this kind of data for two reasons: (1) a list of what they check for isn’t as important as how well and how thoroughly they actually assess each item, and (2) worthwhile vendors are constantly adding new items to the list, so it doesn’t represent any static master blueprint for their product.
Does your tool test more than OWASP vulnerabilities?
The problem with developing security testing tools is in part the over-reliance on the standardization of vulnerability definition and classifications. While it is helpful to categorize vulnerabilities into conceptually similar groups to create common mitigation strategies and mitigation techniques, too often security vendors focus on OWASP attack classifications as the definitive scope for probative activities. Don't get me wrong, these are excellent guides for ensuring the most common types of attacks are covered, but they do not provide a comprehensive test of application security. Too often the types of testing such as incremental information disclosure, where various pieces of the system provide information that can be used to discern how to attack the system further, are relegated to manual penetration testing instead of codified into scanning criteria. Path disclosure and path traversal vulnerabilities are a class of incremental information disclosures that are routinely tested for by scanning tools, but they represent only a file-system basis test for this kind of security problem instead of part of a larger approach to the problem through systematic scanning.Moreover, SaaS providers should consider DoS/DDoS weaknesses as security problems, not just customer relationship or business continuity problems. These types of attacks can cripple a provider and draw their technical talent to the problem at hand, mitigating the denial of service attack. During those periods, this can and has recently been used in high-profile fake-outs to either generate so much trash traffic that other attacks and penetrations are difficult to perceive or react to, or to create opportunities for social engineering attacks to succeed with less sophisticated personnel while the big-guns are trying to tackle the bigger attacks. Until weaknesses that can allow for high-load to easily take down a SaaS application are included as part of vulnerability scanning, this will remain a serious hole in the testing methodology of a security scanning vendor.
So, seeing CVE identifiers and OWASP classifications for reported items is nice from a reporting perspective, and it gives a certain credence to mitigation reports to auditors, but don’t let those lull you into a false sense of security coverage. Ask your vendor what other types of weaknesses and application vulnerabilities they test for outside of the prescribed standard vulnerability classifications. Otherwise, you will potentially shield yourself from “script kiddies”, but leave yourself open to targeted attacks and advanced persistent threats that have created embarrassing situations for a number of large institutions in the past year.
What is your mobile strategy?
Native mobile applications are the hot-stuff right now. Purists tout the HTML5-only route to mobile application development, but mobile web development alone isn't enough to satisfy Apple to get access to the iOS platform, (since 2008) and consumers still can detect a web app that is merely a browser window and prefer the feature set that comes from native applications, including camera access, accelerometer data, and usage of the physical phone buttons into application navigation. The native experience is still too nice to pass up to be at the head-of-the-class in your industry.If you’re a serious player in the SaaS market, you have or will soon have a native mobile application or hybrid-native deliverable. If you’re like most other software development shops, mobile isn’t your forte, but you’ve probably hired specific talent with a mobile skill set to realize whatever your native strategy is. Are your architects and in-house security professionals giving the same critical eye to native architecture, development, and code review as they are to your web offering? If you’re honest, the answer is: probably not.
The reason your answer is ‘probably not’ is because it is a whole different technology stack, set of development languages, and testing methodology where the tools you invested in to secure your web application do not apply to your native application development. This doesn’t mean your native applications are not vulnerable, it means they’re vulnerable in different ways that you don’t even know or are testing for yet. This should be a wake-up call for enterprise software shops: because a vulnerability exists only on a native platform does not mitigate its seriousness. It is trivial to spin up a mobile emulator to host a native application and use the power of a desktop or server to exploit that vulnerability on a scale that could cripple a business through disclosure or denial of service.
Your native mobile security scanning strategy should minimally cover two important surface areas:
-
Vulnerabilities in the way the application stores data on the device in memory and on any removable media
-
Vulnerabilities in the underlying API serving the native application
If you’re not considering these, then you probably have not selected a native application security scanning tool checking for these either.
In Conclusion
Security is always a moving target, as fluid as the adaptiveness of the techniques of attackers and the rapid pace of change in technologies they attack. Don't treat security scanning and penetration testing as a checklist item for RFP's or to address auditor's concerns -- understand the surface areas, and understanding the failings of security vendors' products. Understand your assessments are valid only in the short-term, and re-evaluation of your vendor mix and their offerings on a continual basis is crucial. Only then will you be informed and able to make the right decisions to be proactive, instead of reactive, regarding the sustainability of your business.Thwarting SSL Inspection Proxies
A disturbing trend in corporate IT departments everywhere is the introduction of SSL inspection proxies. This blog post explores some of the ethical concerns about such proxies and proposes a provider-side technology solution to allow clients to detect their presence and alert end-users. If you're well-versed in concepts about HTTPS, SSL/TLS, and PKI, please skip down to the section entitled 'Proposal'.
For starters, e-commerce and many other uses of the public Internet are only possible because the capability for encryption of messages to exist. The encryption of information across the World Wide Web is possible through a suite of cryptography technologies and practices known as Public Key Infrastructure (PKI). Using PKI, servers can offer a "secure" variant of the HTTP protocol, abbreviated as HTTPS. This variant itself encapsulates other application level protocols, like HTTP, using a transport-layer protocol called Secure Socket Layer (SSL), which as since been superseded by a similar, more secure version, Transport Layer Security (TLS). Most users of the Internet are familiar with the symbolism common with such secure connections: when a user browses a webpage over HTTPS, usually some visual iconography (usually a padlock) as well as a stark change in the presentation of the page's location (usually a green indicator) show the end-user that the page was transmitted over HTTPS.
SSL/TLS connections are protected in part by a server certificate stored on the web server. Website operators purchase these server certificates from a small number of competing companies, called Certificate Authorities (CA's), that can generate them. The web browsers we all use are preconfigured to trust certificates that are "signed" by a CA. The way certificates work in PKI allows certain certificates to sign, or vouch for, other certificates. For example, when you visit Facebook.com, you see your connection is secure, and if you inspect the message, you can see the server certificate Facebook presents is trusted because it is signed by VeriSign, and VeriSign is a CA that your browser trusts to sign certificates.
So... what is an SSL Inspection Proxy? Well, there is a long history of employers and other entities using technology to do surveillance of the networks they own. Most workplace Internet Acceptable Use Policies state clearly that the use of the Internet using company-owned machine and company-paid bandwidth is permitted only for business use, and that the company reserves the right to enforce this policy by monitoring this use. While employers can easily review and log all unencrypted that flows over their networks, that is any request for a webpage and the returned rendered output, the increasing prevalence of HTTPS as a default has frustrated employers in recent years. Instead of being able to easily monitor the traffic that traverses their networks, they have had to resort to less-specific ways to infer usage of secure sites, such as DNS recording.
(For those unaware and curious, the domain-name system (DNS) allows client computers to resolve a URL's name, such as Yahoo.com, to its IP address, 72.30.38.140. DNS traffic is not encrypted, so a network operator can review the requests of any computers to translate these names to IP addresses to infer where they are going. This is a poor way to survey user activity, however, because many applications and web browsers do something called "DNS pre-caching", where they will look up name-to-number translations in advance to quickly service user requests, even if the user hasn't visited the site before. For instance, if I visited a page that had a link to Playboy.com, even if I never click the link, Google Chrome may look up that IP address translation just in case I ever do in order to look up the page faster.)
So, employers and other network operators are turning to technologies that are ethically questionable, such as Deep Packet Inspection (DPI), which looks into all the application traffic you send to determine what you might be doing, to down right unethical practices of using SSL Inspection Proxies. Now, I concede I have an opinion here, that SSL Inspection Proxies are evil. I justify that assertion because an SSL Inspection Proxy causes your web browser to lie to it's end-user, giving them a false assertion of security.
What exactly are SSL Inspection Proxies? SSL Inspection Proxies are servers setup to execute a Man-In-The-Middle (MITM) attack on a secure connection, on behalf of your ISP or corporate IT department snoops. When such a proxy exists on your network, when you make a secure request for [www.google.com](https://www.google.com), the network redirects your request to the proxy. The proxy then makes a request to [www.google.com](https://www.google.com) for you, returns the results, and then does something very dirty -- it creates a lie in the form of a bogus server certificate. The proxy will create a false certificate for www.google.come, sign it with a different CA it has in its software, and hand the response back. This "lie" happens in two manners:
- The proxy presents itself as the server you request, instead of the actual server you requested.
- The proxy states the certificate handed back with the page response is a different one than what was actually handed back by that provider, www.google.com in this case.
This interchange would look like this:
It sounds strange to phrase the activities of your own network as an "attack", but this type of interaction is precisely that, and it is widely known in the network security industry as a MITM attack. As you can see, a different certificate is handed back to the end-user's browser than what www.example.com in the above image. Why? Well, each server certificate that is presented with a response is used to encrypt that data. Server certificates have what is called a "public key" that everyone knows which unique identifies the certificate, and they also have a "private key", known only by the web server in this example. A public key can be used to encrypt information, but only a private key can decrypt it. Without an SSL Inspection Proxy, that is, what normally happens, when you make a request to www.example.com, example.com first sends back the public key of the server certificate for its server to your browser. Your browser uses that public key to encrypt the request for a specific webpage as well as a 'password' of sorts, and sends that back to www.example.com. Then, the server would use its private key to decrypt the request, process it, then use that 'password' (called a session key) to send back an encrypted response. That doesn't work so well for an inspection proxy, because this SSL/TLS interchange is designed to thwart any interloper from being able to intercept or see the data transmitted back and forth.
The reason an SSL Inspection Proxy sends a different certificate back is so it can see the request the end-user's browser is making so it knows what to pass on to the actual server as it injects itself as a proxy to this interchange. Otherwise, once the request came to the proxy, the proxy could not read it, because the proxy wouldn't have www.example.com's private key. So, instead, it generates a public/private key and makes it appear like it is www.example.com's server certificate so it can act on its behalf, and then uses the actual public key of the real server certificate to broker the request on.
Proposal
The reason an SSL Inspection Proxy can even work is because it signs a fake certificate it creates on-the-fly using a CA certificate trusted by the end user's browser. This, sadly, could be a legitimate certificate (called a SubCA certificate), which would allow anyone who purchases a SubCA certificate to create any server certificate they wanted to, and it would appear valid to the end-user's browser. Why? A SubCA certificate is like a regular server certificate, except it can also be used to sign OTHER certificates. Any system that trusts the CA that created and signed the SubCA certificate would also trust any certificate the SubCA signs. Because the SubCA certificate is signed by, let's say, the Diginotar CA, and your web browser is preconfigured to trust that CA, your browser would accept a forged certificate for www.example.com signed by the SubCA. Thankfully, SubCA's are frowned upon and increasingly difficult for any organization to obtain because they do present a real and present danger to the entire certificate-based security ecosystem.
However, as long as the MITM attacker (or, your corporate IT department, in the case of an SSL Inspection Proxy scenario) can coerce your browser to trust the CA used by the proxy, then the proxy can create all the false certificates it wants, sign it with the CA certificate they coerced your computer to trust, and most users would never notice the difference. All the same visual elements of a secure connection -- the green coloration, the padlock icon, and any other indicators made by the browser, would be present. My proposal to thwart this:
Website operators should publish a hash of the public key of their server certificates (the certificate thumbprint) as a DNS record. For DNS top-level domains (TLD's) that are protected with DNSSEC, as long as this DNS record that contains the has for www.example.com is cryptographically signed, the corporate IT department of local clients nor a network operator could forge a certificate without creating a verifiable breach that clients could check for and then warn to end users. Of course, browsers would need to be updated to do this kind of verification in the form of a DNS lookup in conjunction with the TLS handshake, but provided their resolvers checked for an additional certificate thumbprint DNS record anyway, this would be a relatively trivial enhancement to make.
EDIT: (April 15, 2013): There is in fact an IETF working group now addressing this proposal, very close to my original proposal! Check out the work of the DNS-based Authentication of Named Entities (DANE) group here: http://datatracker.ietf.org/wg/dane/ -- on February 25, they published a working draft of this proposed resolution as the new "TLSA" record. Great minds think alike. :)
CNN Lies to Every One of Its Web Viewers
When is it okay to flat out lie to your users? I would argue: Never. But the website of one of the world’s most watched sources of news, CNN, does just that.
Near the bottom of every article is a section called “We recommend” and “From around the web”. These sections list about six links to other articles either on CNN itself, other Turner properties, or simply as a paid referral service for selected partners. So what’s my beef with this? It’s not the targeted marketing, it’s the outright lie I noticed they make when you hover over any of those links with your mouse.
For some background, I’m a huge dissident against outbound link tracking. It’s fundamentally the same as gluing a GPS tracking device to your forehead and giving a a tracking device to the website you’re visiting. I have a problem with it because I think there is a fundamental freedom that is eroded by this technology - the freedom to consume information without being tracked for doing so. Do I have the right to pick up a magazine and browse through it without giving someone my telephone number? I would say yes – I think it is a natural right to be able to consume information without having your consumption observed.
But my belief here isn’t realistic – tracking basic visitor behavior and consumer preferences is the basic monetization and sustainability model for most of the Web as we know it. So, this world doesn’t mesh with my perfect world, but at least I should know if someone is observing my behavior, right? Observing CNN’s privacy policy one can clearly see the word “link” is referenced twice, once in relation to third-party sites that may cookie you, and once for integration to social media or other partner sites that may have differing privacy policies.
Okay, fair enough, therefore I should expect that if I am surfing just CNN’s website, if I disable cookies, and if I turn on my do not track header, I should expect not to be tracked, right? No, and the reason is I cannot find out when I’m still on the CNN site to only stay within it. The reason is CNN has specifically coded it’s site to lie to me about when I’m staying within it or navigating away. For an example, if I were to hover over one example link in these two sections, I see the following in my browser status bar:
www.cnn.com/2012/07/15/sport/jason-kidd-arrested/index.htmlI right-clicked the link in Chrome and copied the URL. Then curiously I noticed the link read differently in the browser status bar when hovering over it, this time reading:
[traffic.outbrain.com/network/r...](http://traffic.outbrain.com/network/redir?key=ad68e2a0a57f3eb04e4553bf2e80b6b2&rdid=349349184&type=MVLVS_d/t1_ch&in-site=false&req_id=968ab83e0a0f44e584d8744520d2aea0&agent=blog_JS_rec&recMode=4&reqType=1&wid=100&imgType=0&refPub=0&prs=true&scp=false&version=59070&idx=3)Youch, what's that, and why did it change? On closer inspection, by viewing the source of the page, I can see the target href of the link is exactly as reproduced above, going to traffic.outbrain.com. I peeked at some other URL's in the same section that I had not yet left-clicked or right-clicked and noticed this:
<a target="_self" href="http://www.cnn.com/2012/07/15/sport/jason-kidd-arrested/index.html" onmousedown="this.href='http://traffic.outbrain.com/network/redir?key=10b8398e7c07227c8a8786b1682f1707&rdid=349349184&type=WMV_d/t1_ch&in-site=false&req_id=968ab83e0a0f44e584d8744520d2aea0&agent=blog_JS_rec&recMode=4&reqType=1&wid=100&imgType=0&refPub=0&prs=true&scp=false&version=59070&idx=4';return true;" onclick="javascript:return(true)">Knicks' Jason Kidd arrested on suspicion of DWI</a>And herein is the deception -- this piece of inline JavaScript code changes the target of the link at the moment it is clicked to go to the traffic.outbrain.com address. Because target href originally reads to the final destination of the article, hovering over it gives the false impression that my click will directly take me to it. Instead, at the moment I click it, the target href is changed to the potentially unscrupulous third-party, and I have been given no browser notification this would happen prior to my click, and upon traffic.outbrain.com responding, it redirects me back to the original CNN article I initially wanted to view. On a broadband connection, you probably wouldn't even notice the superfluous page load and redirect back to CNN's site. Deceptive!
So, sure, why should anyone care? Isn’t this just plumbing, technology, and toolbox of tricks inherit of the Web? Maybe, but the problem here is the lie. You do not lie to your users. Ever. Outbound web tracking is not a web beacon. Web beacons are a different kind of “evil” - usually some JavaScript that opens an IFRAME to a third-party site that issues a cookie to track you; however, web beacons are covered by CNN’s privacy policy, so if they were equivalent, it’s all fair. Web beacons can be simply disabled by turning off third-party cookies in today’s browsers. This is precisely why outbound link tracking is becoming popular - it circumvents the privacy management tools most users have available and have knowledge of. Outbound link tracking is no more insidious than web beacons are, but the implementation of them often lies to the end user about what their action will do (a click in this case). An honest implementation would be to either clearly state in the privacy policy that any links you click may be link tracked or simply not to deceive the user by rewriting the target href the moment they click it to actually go to the link tracking site so the browser status bar is truthful on hover (Twitter’s t.co strategy).
Well, at least it’s just CNN at fault here. At least no one else would stoop to such shady tactics. Surely not Google (/url) or Facebook (l.php).. no, definitely not…

