RSS

Alkami: A Retrospective

What a wild and crazy journey the last five years have been.

When I started this blog in 2009, it was shortly after I had inked a deal with an angel investor and journeyed down the road with him and my other co-founder and established Alkami Technology.  Against significant odds, this October marks the five year anniversary of a roller-coaster ride on up, which galvanized Alkami as the clear leader in the online banking space.  Before jumping into this endeavor, I was no stranger to walking products from idealization to realization or running enterprise services in a SaaS model.  But doing all that against the tremendous downside risks of the start-up world, as the new kid on the block among a world of established, very-well funded competitors has been challenging. Actually, it’s been brutal.

Reflecting on the past sixty months, I’ve started to pull together my notes from the early days, both before and after founding Alkami, and I will be commemorating this milestone with a series of blog posts on some company history – the why and how, as well as some valuable and hard-learned lessons along the way.  No one, no company finds tremendous success spontaneously.  While a Inc 500 splash piece on a company might portray success like a serendipitous fairy tale, only through a voracious appetite for risk, an iron stomach for failure, and a committed and skilled team does any great company find its footing.  It’s a great feeling to walk into the office every week and see new, fantastic talent we’ve added to our team and forward-leaning designs and concepts in our flagship solution.  It’s also a very satisfying one to know your personal efforts and sacrifices made that team and that company possible.

This series of posts will not be a beating of the chest or self-congratulatory account of our accolades.  Our work is far from over, and I judge success on a much longer time horizon.  But it will be a real account of our origin story, entrepreneurship, missteps and course correction, and moving from start-up to scale-out in a slow sales cycle, highly-regulated industry.  It’s one thing to have a hip product idea you incubate through an accelerator and debut on a demo day. It’s a very different thing to bootstrap a firm and an entire platform where you have to answer a few hundred RFP questions to get a prospect to even talk with you, many other steps to get just one sale, and many sales to get that kind of investor attention.

Those pieces are now in place and solidifying every day as we take an aggressive product and technical vision to its successful conclusion.  I’m honored to have found great working partners, worked (and still mostly continue to work) with some of the most committed and skilled people across a variety of disciplines along the way.  As we look back in retrospect at five formative years, I’m eager to chronicle our story and to add others who will extend and craft our bright future. Stay tuned.

 
Leave a comment

Posted by on October 1, 2014 in Uncategorized

 

Security Advisory for Financial Institutions: Shell Shock

“Shell Shock” Remote Code Execution and Compromise Vulnerability

Yesterday evening, DHS National Cyber Security Division/US-CERT published CVE-2014-6271 and CVE-2014-7169, outlining a serious vulnerability in a widely used command line interface (or shell) for the Linux operating system and many other *nix variants.  This software bug in the Bash shell allows files to be written on remote devices or remote code to be executed on remote systems by unauthenticated, unauthorized malicious users.  Because the vulnerability involves the Bash shell, some media outlets are referring to this vulnerability as Shell Shock.

Nature of Risk

By exploiting this parsing bug in the Bash shell, other software on a vulnerable system, including operating system components, can be compromised, including the OpenSSH server process and the Apache web server process. Because this attack vector allows an attacker to potentially compromise any element of a vulnerable system, effects from website defacement to password collection, malware distribution, and retrieval of protected system components such as private keys stored on servers are possible, and the US-CERT team has rated this it’s highest impact CVSS rating of 10.0.

Please be specifically aware that a patch was provided to close the issue for the original CVE-2014-6271; however, this patch did not sufficiently close the vulnerability.  The current iteration of the vulnerability is CVE-2014-7169, and any patches applied to resolve the issue should specifically state they close the issue for CVE-2014-7169.  Any devices that are vulnerable and exposed to any untrusted network, such as a vendor-accessible extranet or the public Internet should be considered suspect and isolated and reviewed by a security team due to the ability for “worms”, or automated infect-and-spread scripts that exploit this vulnerability, to quickly affect vulnerable systems in an unattended manner.  Any affected devices that contain private keys should have those keys treated as compromised and have those keys reissued per your company’s information security policies regarding key management procedures.

Next Steps

All financial institutions should immediately review their own environments to determine that no other third-party systems that are involved in serving or securing the online banking experience, or any other publicly-available services, are running vulnerable versions of the Bash shell.  Any financial institution that provides any secure services with Linux or *nix variants running a vulnerable version of the Bash shell could be at risk no matter what their vendor mix. If any vulnerable devices are found, they should be treated as suspect and isolated per your incident response procedures until they are validated as not affected or remediated.  All financial institutions should immediately and thoroughly review their systems and be prepared to change passwords on and revoke and reissue certificates with private key components stored on any compromised devices.

For further reading on this issue:

 
Leave a comment

Posted by on September 25, 2014 in Security

 

End-User Credential Security

This week’s announcement that a Russian crime syndicate has amassed 1.2 billion unique usernames and passwords across 420,000 websites would seem like startling news in 72-point font on the front of major newspapers, if it wasn’t sad it was such a commonplace announcement these days.  With four more months to go and still higher than the estimated 823 million compromised credentials part of 2013 breaches affecting Adobe to Target, it’s from Black Hat 2014 I find myself thinking about what we as ISV’s, SaaS providers, and security professionals can do to protect users in the wake of advanced persistent threats and organized, well-funded thieves wreaking havoc on the digital identities and real assets of our clients and customers.

Unlike Heartbleed or other server-side vulnerabilities, this particular credential siphoning technique obviously targeted users themselves to affect so many sites and at least 542 unique addresses affecting at least half that many unique users.  Why are users so vulnerable to credential-stealing malware?  To explore this issue, let’s immediately discard a tired refrain inside software houses everywhere: users aren’t dumb.  All too often, good application security is watered down to its least secure but most useful denominator for an overabundance of concern that secure applications may frustrate users, lower adoption, and reduce retention and usage.  While it is true that the more accessible the Internet becomes, the wider the spectrum the audience that uses it, from the most expertly capable to the ‘last-mile’ of great grandparents, young children, and the technologically unsophisticated.  However, this should neither be grounds to dismiss end-user credential security as a concern squarely in service provider’s court to address nor should it be an excuse to fail to provide adequately secure systems.  End-user education is our mutual responsibility, even if that means three more screens, additional prompts to confirm identity or action, or an out-of-band verification process.  Keeping processes as stupefying simple as possible because our SEO metrics show that’s the way to marginally improve adoption, reduce cart abandonment, or improve site usage times breeds complacency that ends up hurting us all in the long-run.

Can we agree that 1FA needs to end?  In an isolated world of controlled systems, a username and password combination might have been a fair assertion of identity.  Today’s systems, however, are neither controlled or isolated – the same tablets that log into online banking also run Fruit Ninja for our children, and we pass them over without switching out any concept of identity on a device that can save our passwords and represent them without any authentication.  Small-business laptops often run without real-time malware scanning software, easily harvesting credentials through keystroke logging, MitM attacks, cookie stealing, and a variety of other commonplace techniques.  Username and passwords fail us because they can be saved and cached just as easily as they can be collected and forwarded to command and control servers is Russia or elsewhere.  I’m not one of those anarchists advocating ‘death to the password’ (remember Vidoop?), but using knowledge-based challenges (password, out-of-wallet questions, or otherwise) as the sole factor of authentication needs to end.  And it needs to end smartly: sending an e-mail ‘out of band’ to an inbox loaded in another tab on the same machine, or an SMS message read by Google Voice in another tab means your ‘2FA’ is really just one factor layered twice instead of two-factor authentication.  A few more calls into the call center to help users cope with 2FA will be far cheaper in the long-run than the fallout of a major credential breach that affects your sites users.

We need to also discourage poor password management: allowing users to choose short or non-complex passwords and warning them about their poor choices is no excuse – we should just flatly reject them.  At the same time, we need to recognize that forcing users to establish too complex of a password will encourage them to establish a small number of complex passwords and reuse them across more sites.  This is one of the largest Achilles’s Heels for end-users: when a compromise of one site does occur, and especially if you have removed the option for users to establish a username not tied to their identity (name, e-mail address, or otherwise), you have made it tremendously easier for those who have gathered credentials from one site to have a much higher likelihood of exploiting them on your site.  Instead, we should consider nuances to each of complexity requirements that would make it likely a user would have to generate a different knowledge-based credential for each site.  While that in of itself may increase the chance a user would ‘write a password down’, a user who stores all their passwords in a password manager is still arguably more secure than the user who users one password for all websites and never writes it anywhere.

Finally, when lists of affected user accounts become available in uploaded databases of raw credentials that are leaked or testable on sites such as https://haveibeenpwned.com/ – ACT.  Find out your users that have overlap with compromised credentials on other sites, and proactively flag or lock their accounts or at least message to them to educate and encourage good end-user credential security.  We cannot unilaterally force users to improve the security of their credentials, but we can educate them, and we can make certain their eventual folly through our inaction.

 
 

When to Ride the Service Bus

One of the great things about adding new, senior talent to a storied team working on a large, complex, and successful enterprise application solution is the critical technical review that results in a lot of “why did/didn’t you do it this way?” questions.  You have two options for responding to those questions – ignore or passively dismissing them, or taking the questions seriously as a challenge to prove out why you would make a decision you and your team made 5 years ago the same if you had to consider for the first time today, in today’s frameworks, development methodologies, and the current team makeup and skills inventory.  If you choose to dismiss these opportunities to critically review your prior decisions, it says a lot about your management style, general appreciation of technology and response to its change, and positions your team to take a reactionary, defensive posture to architecture rather than create a team that honors a proactive, continuous improvement perspective.  Far more interesting too are those questions that ask why the system is architected in a general way, rather than a theological debate on whether a particular technology component choice is superior to all over or one’s preferred/familiar choice.

The particular question the new engineer asked was, “Why aren’t we using a service bus?”  Instead of answering him directly, I figured this as a good opportunity to explore the previous decision we made that not only did not include an enterprise service bus (ESB) in the original design, but rejected its inclusion when it was strongly suggested by our first customer because they were standardizing on a service bus-centric architecture themselves.  The primary advantage of a service bus is to layer an abstraction across heterogeneous systems by implementing a centralized communication mechanism across components.  By applying this architectural model, you can get some key benefits including orchestration, queuing to handle intermittent component availability, and extensibility points for message routing to alter dispatch logic or transform messages.  Implementing the service bus pattern requires some kind of adapter to be written for each component of the system, either as a local modification to each component or by choosing to standardize on a communication channel provided by the ESB.  Even in the latter, usually some minor accommodation is required to allow the ESB to receive and encapsulate the native message for delivery to the destination component. Our first customer was a notable player in the community banking market, and was productizing multiple new SaaS-based web applications that depended on data feeds coming from many different customers.  In their scenarios, data was consumed by one application, parsed, and delivered to other applications, which in turn may have created additional data feeds for other products, in a cyclic communication/dependency non-directed graph.  Each application was developed by different teams, and there was no unified technology stack adoption – some teams were developing on EJB and Flex, others were pure .NET, and teams generally had the discretion to choose whatever they could argue would solve the job, without a strong technology leader looking to unify the stack for similar applications that delivered CMS and pseudo-online banking functionality using a common input data set.

For this customer, ESB was a solution to a problem – their choices lead to a highly concurrent development process with multiple independent teams – but also supported connecting a heterogeneous environment of interdependent components, each of which accomplished limited objectives.  This organization was running red-hot – developing ancillary products to a highly engaged and fanatic client base of community banks, where their limiting factor was their speed of innovation and delivery.  By agreeing on a common communication mechanism that ESB could provide, there was something, albeit low-level, to which all teams agreed.  In the ‘controlled agile chaos’ they found themselves in, the abstraction bought them flexibility to adapt changing business requirements using orchestration.  In theory anyway – they ended up moving much slower than they anticipated, but this wasn’t the fault of ESB. ESB solves two classes of problems.  The first is the common use case of large, disparate enterprises looking to marry systems established from the dawn of client-server architectures to the newest Node.js hotness, without having to bend the will of any particular system to the communication conventions of any other, which may prove impossible if both systems are proprietary.  This is a common use case for BizTalk, especially in the financial sector.  All the other benefits you can name off from a service bus architecture are really secondary advantages to this key objective.  The second is the use case that any layer of indirection provides: an abstraction you can use to increase the speed of development when requirements are incomplete or prone to pivot.  In each case, you invest in a layer to reduce the cost of future change. This particular customer chose NServiceBus as their message-oriented middleware.  We seriously evaluated both the general architectural concepts ESB as well as the particular technology they suggested and came up with a definitive ‘no’ to that choice.  While it made a lot of sense for our customer, it did not make sense for us because:

  1. We did not require guaranteed event handling.  Our system connected to a system of record that provided transactional consistency, and virtually all state changes were initiated by users through a web browser.  A timeout was preferable to queued command handling system because of the possibility of duplicate transactions that frustrated users may initiate, not realizing their requests were queued.  Second, our interconnected systems did not provide guaranteed event handling, so the guaranteed provided by the ESB would now be honored end-to-end.  Third, we are using the Windows Identity Foundation with sliding time expirations end-to-end from the user’s browser through the lowest layer of service components, which doesn’t bode well for delayed delivery situations, even if the user was willing to wait.
  2. We do require transformation, but not orchestration between our components.  Our system features adapter-based design to allow multiple types of endpoints to be serviced by a single service implementation for those portions that may need to connect to a different type of third-party system through a provider model implementation loaded by dependency injection.  We could have chosen to use ESB for this piece, however, we perceived the long-term maintenance cost of multiple providers with the party-specific transformation logic to be lower than maintaining those transforms in ESB scripting or adapters.  In reviewing this perception today, I believe it was still the right decision because is allowed for us to unit-test our transformation logic without including the ESB.
  3. An ESB is a single point of failure that would independently need to scale for load exponentially proportional to the number of service interconnects in our solution, and would add some amount of latency between each. Because online banking is a mission-critical, customer-facing solution, it cannot have SPOF’s in any portion of the architectural design.  The SPOF nature of an ESB can be mitigated in multiple ways, but we felt that was at least two layers of complexity we could solve in other, simpler ways.
  4. All middleware increases the Mean Time Between Failures (MTBF).  This is not a risk specific to ESB, but of any layer added to a system.  If you add an ORM, IOC, ESB, or even a logging aspect, something can go wrong with them.  Each component has some small, but measurable failure rate, and when inserted into the communication chain between all components, its reliability of 99.999% still contributes to a reduction in the overall reliability of a serial system.  This is where the KISS principle shines – complexity creates unreliability, so all complexity must generate a compelling benefit in excess of its potential to fail.
  5. We wanted our application layer to be the platform, we did not want ESB to be the platform.  This was a business case / competitive advantage decision that we wanted to build as a feature of our system that the same services layer that supported our front-end user interfaces was also an open and extensible platform upon which our clients could integrate to, which would increase the overall value proposition of online banking not only as a sticky end-user experience, but also as a value proposition to capitalize on our solution as the middleware that marries together all the disparate systems within a financial institution, which ultimately online banking does like no other piece of technology within a bank or credit union.  We felt that by positioning everything behind an ESB, the perceived value of our technology piece would be lessened without additional client education.
  6. MSMQ made us feel dirty enough, and we did not want to mandate it for each component because it was in 2009 and still is relatively difficult to debug, and lately we have learned, queues do not work well with used with Layer 7 network load balancing.  The new hotness of 0MQ wasn’t around then, and while RabbitMQ was, it was arguably not production ready by that time.  For us, production-ready isn’t just whether a component is capable, but whether it will have general acceptance from the IT departments of our large clients – many newer technologies that are FOSS or from vendors without an establish track record require a ‘sale’ and buy-in during due diligence, long before ink is applied to a contract.  Even if they were options for the ESB queuing mechanism, they would not resolve the larger aforementioned concerns.
  7. At the time we made this choice, AMQP was an amorphous draft that did not solidify until later.  The lack of a vendor-independent protocol between components and an ESB made the choice to utilize an ESB subject to vendor lock-in, which we were not willing to tolerate for such a critical component.
  8. Because our product was both the end-user experience and the middleware we were writing, we felt strongly that the application protocol should provide descriptive metadata and support fast client proxy generation using .NET-based tools.  REST support was archaic at best (HttpRequest anyone?) in .NET 3.5, and to this day, consuming SOAP services is intrinsically more verbose in C# and VB.NET (HttpClient) than consuming REST or AMQP services due to a lack of better library and integrated language support for it.  Looking back on this, with a large amount of iterative change we went through from ideation to Version 1.0 of our solution, we could not have moved as fast without a fast way to regenerate proxies that would cause build failures to alert us of service operation signature changes — tracking these down at runtime (REST) or having to debug a secondary system (ESB) to find these would have bogged down our delivery timelines.
  9. A lesser concern was we felt that tracing SOAP messages, while definitely more difficult than REST, would be more difficult debug any issues in AMQP or other ESB encapsulation protocols than inspecting SOAP envelopes with built-in WCF tools already present in the .NET development stack.

So, that’s quite a case against an ESB, but they do have compelling uses for certain environments – just not ours.  Like all technology selection decisions, it’s important to pick the right tool for the job, and improve your tools as needed.  A standalone ESB would have provided significant benefits if we were developing with proprietary/closed third-party systems that were part of a call chain that required orchestration, or if we were developing with a heterogeneous mix of technologies.  In our case, we had a predictable homogeneous .NET environment based on web services, consumers of our API are our own technologies or a limited number of customers who were also using .NET, and we had no legacy baggage.  With the widespread adoption of WS-* standards, we have chosen to obtain some of the benefits, such as federation, from those standards rather than an ESB feature, which ultimately we believe makes our platform easier to support and distribute for our future API consumers.  Other side benefits such as logging are handled as separated concerns through dependency injection rather than external interceptors in a communication channel, a possibility for us only because we control the portion of the stack that requires orchestration.  And finally, by keeping all communication as SOAP over HTTP/HTTPS, we gain features like load balancing from Layer 7 network devices instead of an ESB process, which are much easier to switch out and upgrade.

The central design decision we made was that ESB’s provide some great features and that ties you into an ESB, but if we could get those features another way that was just as convenient or more so, we’d prefer the plug-and-play flexibility of leveraging existing solutions for components such as caching and load balancing in the environment our solution operates, or pick those pieces ad-hoc for those concerns rather than pick the best omnibus solution and work around any specific shortcomings for any one of them. In reviewing the current industry literature and blog posts and looking at general trends, it would seem our decision not to marry our solution is generally the path many take when not required to integrate legacy systems as part of an orchestration chain or when using non-HTTP based transport mechanisms.  If you’re using one, hopefully it’s for a good and necessary reason!  For us, though, we decided not to hop on a service bus that could take us somewhere we already arrived.

* As an aside, we actually did end up rolling our own small “ESB” as a TCP port multiplexer that queues and portions out connectivity to a socket-based, legacy third-party component that has no listener back-queue and no port concurrency, highly unusual for a server process.  Each connection consumes the port fully for the duration of the short transaction, and we had to write a way to buffer M number of requests and hand them off to (M-N) number of available ports as they became available,in a specialized type of producer-consumer problem. In hindsight, this was an opportunity to use an ESB, but in our case, we only required message routing and load leveling, and in a few hundred lines of code, we implemented what we needed for this particular third-party system what would have taken us far longer to do as our first time using an ESB. That being said, should we encounter this with another vendor, it would make sense to review using an ESB for this type of functionality in the future.
 
Leave a comment

Posted by on April 1, 2014 in Programming

 

Tags: , ,

The Wires Cannot Be Trusted; Does DRM Have Something to Teach Us?

In the continuing revelations about the depth to which governments have gone to subjugate global communications in terms of privacy, anonymity, and security on the Internet, one thing is very clear: nothing can be trusted anymore.

Before you wipe this post off as smacking of ‘conspiracy theorist’, take the Snowden revelations disclosed since Christmas, particularly regarding the NSA’s Tailored Access Operations catalog that demonstrates the ways they can violate implicit trust in local hardware by infecting firmware at a level where even reboots and factory ‘resets’ cannot remove the implanted malware, or their “interdiction” of new computers that allow them to install spyware between the time it leaves the factory and arrives at your house.  At a broader level, because of the trend in global data movement towards centralizing data transit through a diminishing number of top tier carriers – a trend is eerily similar to wealth inequality in the digital era – governments and pseudo-governmental bodies have found it trivial to exact control with quantum insert attacks.  In these sophisticated attacks, malicious entities (which I define for these purposes as those who exploit trust to gain illicit access to a protected system) like the NSA or GCHQ can slipstream rogue servers that mimic trusted public systems such as LinkedIn to gain passwords and assume identities through ephemeral information gathering to attack other systems.

Considering these things, the troubling realization is this is not the failure of the NSA, the GCHQ, the US presidential administration, or the lack of public outrage to demand change.  The failure is in the infrastructure of the Internet itself.  If anything, these violations of trust simply showcase technical flaws we have chosen not to acknowledge to this point in the larger system’s architecture.  Endpoint encryption technologies like SSL became supplanted by forward versions of TLS because of underlying flaws not only in cipher strength, but in protocol assumptions that did not acknowledge all the ways in which the trust of a system or the interconnects between systems could be violated.  This is similarly true for BGP, which has seen a number of attacks that allow routers on the Internet to be reprogrammed to shunt traffic to malicious entities that can intercept it: a protocol that trusts anything is vulnerable because nothing can be trusted forever.

When I state nothing can be trusted, I mean absolutely nothing.  Your phone company definitely can’t be trusted – they’ve already been shown to have collapsed to government pressure to give up the keys to their part of the kingdom.  The very wires leading into your house can’t be trusted, they could already or someday will be tapped.  Your air-gapped laptop can’t be trusted, it’s being hacked with radio waves.

But, individual, private citizens are facing a challenge Hollywood has for years – how do we protect our content?  The entertainment industry has been chided for years on its sometimes Draconian attempts to limit use and restrict access to data by implementing encryption and hardware standards that run counter to the kind of free access analog storage mediums, like the VHS and cassette tapes of days of old, provided.  Perhaps there are lessons to be learned from their attempts to address the problem of “everything, everybody, and every device is malicious, but we want to talk to everything, everybody, on every device”.  One place to draw inspiration is HDCP, a protocol most people except hardcore AV enthusiasts are unaware of that establishes device authentication and encryption across each connection of an HD entertainment system.  Who would have thought when your six year old watches Monsters, Inc., those colorful characters are protected by such an advanced scheme on the cord that just runs from your Blu-ray player to your TV?

While you may not believe in DRM for your DVD’s from a philosophical or fair-use rights perspective, consider the striking difference with this approach:  in the OSI model, encryption occurs at Layer 6, on top of many other layers in the system.  This is an implicit trust of all layers below it, and this is the assumption violated in the headlines from the Guardian and NY Times that have captured our attention the most lately: on the Internet, he who controls the media layers also controls the host layers.  In the HDCP model, the encryption happens more akin to Layer 2, as the protocol expects someone’s going to splice a wire to try to bootleg HBO from their neighbor or illicitly pirate high-quality DVD’s.  Today if I gained access to a server closet in a corporate office, there is nothing technologically preventing me from splicing myself into a network connection and copying every packet on the connection.  The data that is encrypted on Layer 6 will be very difficult for me to make sense of, but there will be plenty of data that is not encrypted that I can use for nefarious purposes: ARP broadcasts, SIP metadata, DNS replies, and all that insecure HTTP or poorly-secured HTTPS traffic.  Even worse, it’s a jumping off point for setting up a MITM attack, such as an SSL Inspection Proxy.  Similarly, without media-layer security, savvy attackers with physical access to a server closet or the ability to coerce or hack into the next hop in the network path can go undetected if they redirect your traffic into rogue servers or into malicious networks, and because there is no chained endpoint authentication mechanism on the media-layer, there’s no way for you to know.

These concerns aren’t just theoretical either, and they’re not to protect teenagers’ rights to anonymously author provocative and mildly threatening anarchist manifestos.  They’re to protect your identity, your money, your family, and your security.  Only more will be accessible and controllable on the Internet going forward, and without appropriate protections in place, it won’t just be governments soon who can utilize the assumptions of trust in the Internet’s architecture and implementation for ill, but idealist hacker cabals, organized crime rings, and eventually, anyone with the right script kiddie program to exploit the vulnerabilities once better known and unaddressed.

Why aren’t we protecting financial information or credit card numbers with media-layer security so they’are at least as safe as Mickey Mouse on your HDTV?

 

Tags: , , ,

Scaling Enterprise Database-Bound Applications: I/O

Optimizing Slow Accesses

While most software developers like to think of themselves as computer scientists in the purest sense of the term, with job duties that would include intimately understanding and exploiting efficiencies of the x64 processor platform, optimizing that critical-path O(log n) algorithm to perform in O(log log n) time, and other acts of mathematical creativity and scientific application, that’s not what most software developers do (or should be doing if they are).

Most software developers are building business (retail B2C, B2B API’s, or LOB‘s), not scientific applications — and that means most are developing I/O-bound, not CPU-bound applications.  Specifically, most business applications are creative user or application programming interfaces around relatively mundane CRUD operations on a data store.  Even more complex applications that perform data synchronization or novel calculations of co-variance or multivariate regression consume maybe 5% of their time crunching data, and the other 95% of the time retrieving and sending it on.

So, when you design an enterprise application and get past the ideation phase and start scaling out your next-generation game-changing application from a cute demo to a serious and robust application serving millions of requests, why would you bother with refactoring your string concatenation in loops into string builders, aiming for zero-copy, or optimizing for CPU performance?  You should not and you should:  You should not be optimizing for CPU performance, unless you have optimized all your slow accesses away — and you should be optimizing for CPU performance because hopefully you’ve already squeezed all the blood out of the I/O turnip you can.

But you haven’t.  I know you haven’t.  You know you haven’t if you are being honest.  Have you ever looked at your database queries per second for specific-entity queries?  For instance, let’s say a user logs into your enterprise application, and a service on your application tier needs to retrieve the record of a user.  That service might call another service to make a record of the user’s login.  Then the user navigates to another page in your application 60 seconds later.  How many times did any component of your system retrieve the user by their unique identifier?  If the answer is, “I don’t know”, you haven’t scratched the surface of scaling an enterprise application, much less my most important axiom of doing so: “Don’t Repeat Requests“.

This is a lot harder than you might think, because enterprise web application development lends itself to repeating requests, and it is not an easy problem to solve, because you are essentially creating state on an application tier for a web tier that hosts a stateless HTTP application protocol.  When functionality is segregated into multiple services with distinct responsibilities, there is some duplication of I/O access that happens to fulfill a request that is unavoidable.  Unless you and everyone on your team completely understands this disjoint and works collectively to design solutions that do not repeat requests, you will repeat requests as part of the natural design of any system.

Caching Isn’t a Magic Bullet, But It Is a Bullet

If you thought this post was going to end at “implement second-level caching on your ORM of choice”, you’re wrong, but you should be doing that for sure.  This is usually as easy installing a caching server like Couchbase, configuring your ORM in a few lines of code or configuration files, and wala – you are still repeating your requests, but this time, answering your repeated requests will be a lot faster than any SSD-backed database server will ever be.

(I say ‘usually’, because this depends on how you’re using your ORM.  If you use your ORM as an expensive way to execute stored procedures, your ORM will be at best a pass-through for database methods and will not give you the benefit of entity caching that could be reused for multiple queries that include that entity as a result.  As with all caching, YMMV depending on how you have designed your layers.)

Once you enable caching, measure.  Measure how many times you ask for that user record when a user logs in and performs some actions over time.  You’ll be amazed that when you view this from a database request level, you will still be asking for the same user over and over again as long as not every component is using the cache for database entities with a consistent cache key.  It’s very hard to get right, both from an application configuration and a caching server configuration perspective — do not assume, but do measure.

Remember, the most important thing to remember is not to get really fast answers to your repeated questions, but stop asking the same questions over and over again!  Caching at the ORM is your tourniquet to stop the bleeding of your performance into database I/O buffers and wait times, but caching at the inter-component request level is critical.  Let’s say you have an enterprise web application that retrieves a forecast for a city for a given period of time.  The web client makes the request for the locale and date range to your application tier, which translates that into queries of whatever entities comprise your data model.  With ORM second-level caching in effect, the next request for the same locale and date range will not ask the question of the database this time, but the answer will come instead from the second-level cache… but stop right there.  The question was asked again at a higher level, you’re just answering it in a more intelligent way the second time around.

Enterprise web applications need to cache the responses of service requests using a cache key that accounts for the parameters of the request.  Hopefully your web application faithfully implements a repository pattern, and if so, you implement a cache into this layer to eliminate repeated requests to the service layer to start with. This is not easy.  This is hard because your ORM’s database caching is likely a black box implementation of complex cache expiration logic that performs all sorts of clever tricks to know when an entity has become ‘dirty’ and needs to be retrieved again from the underlying database rather than use the cached copy.  If you’re developing business applications, you’re probably not accustomed to being clever at this level, and you will need to spend the time to implement this manually throughout your repository pattern (unless you thought ahead and can add caching as an aspect) and to bust your caches.

Challenges of Busting Caches

Busting your own caches – that is, invalidating a cached entry when you have reason to know the cached version is no longer good – is one of the trickiest things to get right in this stage of Don’t Repeat Requests.  Let’s take a service method called GetUser() that returns the user and an object graph of some interesting things that cover multiple data entities from the database.  At the web tier, we start caching that call when we make it so subsequent calls from the web tier won’t even request this from the service while its in cache.  But what else could change the User object in the database?  If the user themselves can, then that’s easy enough to know to bust a cache on a User repository .Save() method, but if other unrelated processes can, such as say, a back-end service process that bulk-updates users for some reason, then this gets more challenging to ensure you’ve identified all the paths that could invalidate the data and make sure each have access to bust the cache for the GetUser() response as cached by the web tier as well as the User entity as represented in any other request (think GetUser(), GetUsersByWhatever(), and all the other variants that may also need cache busting).  When GetUser() actually includes data sourced from other entities, you have to think about the dependent object graph in the data model and ensure you’ve accounted for these as well.  You just have to consider but not handle this recursive analysis for deep object graphs — it only matters as much as it matters for the user experience.

This kind of task must be reserved for the architects and most senior engineers who know your system design and inter-dependencies inside and out to avoid data consistency errors.  A key point is as long as all data validation logic is performed at the lowest layer under any custom caching work you perform, data consistency errors will at worst create a poor user experience.  If you don’t – if you have critical client-side validation that is not mirrored under caching on the service-side of your architecture, you have bigger security risks and other problems than caching, but this will definitely impede your ability to deploy service request caching and scale your application.

Caching From Within

Within any area of your application, beware anti-patterns that repository patterns can create.  If you author MethodA() that calls MethodB() that calls MethodC(), all of which individually call UserRepository.GetUser(), then you’re recursively repeating yourself.  Repository patterns are nice because they reduce the repetitive session and connection management functions involved with making a web service or database call, but they make it easy to forget that they’re very, very heavy methods.

Do not be afraid to accumulate.  Do not be afraid to pass object graphs through method parameters to save I/O.  You could think about the call stack as your cache here, and while you shouldn’t load it up as an unnecessarily heavy omnibus object to pass around to any method, and while you definitely should not front-load all your I/O before calling a logical method chain before conditional logic or exception management could make some of the calls unnecessary, intelligently design methods so they don’t take the smallest parameter set possible, but create the best scalability when working in concert.

Caching Outside Your Boundaries

If you’re writing enterprise web applications for a product that is not dying or decaying, you’re writing it in HTML5 today.  And if your web design isn’t from a Frontpage 98 template, you’re probably using AJAX requests either to improve user experiences and reduce perceived page load times or maybe you’ve gone whole-hog into an SPA design.  With HTML5 and a relatively modern web browser, you have LocalStorage.  Use LocalStorage.

You should be using LocalStorage to cache and bust non-error responses to AJAX requests to your web services and REST endpoints. Just because you’ve thinned out the pipes from services to the database and from the web tier to the services tier, why stop there?  Why continue to allow browsers to repeat requests to your web tier as a user moves back and forth between areas or pages?  If you rest on your laurels on a job-well-done, but still repeat unnecessary I/O queries at a level higher up in the chain, then you’ve made your application more performant but not truly scalable — you’ve just shifted the blame.

The F5 Test

I propose what I will call the “F5 Test” for scalability.  When you’ve cached all you can cache, and every layer is implementing the “Don’t Repeat Requests” mantra, open up your database profiler and your Couchbase cache hit dashboard.  Log into your application’s dashboard, reporting, or whatever page you want to test, then clear your profiler and cache hit counters.  Press F5.  You should see very, very little activity on a reload, and you should be able to explain what you do see.

But, for what you do see, justify each and don’t make excuses for yourself:

  1. If your dashboard makes repeated requests because you feel it “always needs to be up-to-date”, then you’re doing it wrong.  Cache and use server-side events to refresh your cached copy.
  2. If you load a user object to determine whether they have a login session, then do you have a good reason for not using browser evidence such as a signed SAML assertion to validate a session instead of using a database lookup to verify a user exists and is authorized?
  3. If you see something you can’t explain, investigate.  I wish this was as obvious as it is intuitive, but many times software developers will be content with an arbitrary improvement (I made 232 database calls on login go down to 47) rather than to do the homework to find out why 47 isn’t 5.  Maybe there are 42 extraneous requests made by a service that doesn’t use the cache even though you thought it did.  Maybe one of those 42 requests causes database locking escalations that won’t scale with load.

Optimizing Query Plans

Oh yeah, and optimize query plans.  This is important work, but it’s not the outer-most layer of the onion.  It’s important to remember the difference between scalability and performance:

  • Performance should be determined by the user experience from dispatching of the request to final rendering of the result to the user in their browser.  Performance is not “how much CPU does the system use under load” – that is resource utilization, though many people use performance for both concepts.
  • Scalability is two-fold: How many users can I get a certain level performance on a certain hardware basline (scaling up), and can I and how often will I have to throw money at more hardware to handle more users at the same level of performance (scaling out)?
  • Improving performance may or may not improve scalability
  • Improving scalability rarely improves performance
  • Management will not understand the difference

Optimizing query plans can impact both: improving a query plan from 6 seconds to 1 second improves performance.  It could improve scalability if your queries are over complex joins or large data sets that couldn’t be pinned in memory automagically in your database server.  But optimizing query plans for speed alone is not a function of scalability — optimizing them for I/O is where it’s at.  Simple improvements like changing JOIN’s to EXISTS’s where feasible allow the query engine to skip unnecessary I/O is what opens up buffers and improves throughput through the disk subsystems where the big performance and scalability penalties hit.  It just so happens complex queries that have I/O in intermediate steps also have high CPU due to hash matching, rewinds, and other operations that perform calculations on large amounts of data generated from unnecessary I/O.

It’s work you should do, but you shouldn’t do it first for scalability reasons.

After-Thoughts: Don’t Report Stupid Results

Building highly-scalable applications from the ground up with a large team is impossible.  You iterate scalability just as you iterate product features.  Actually, hopefully you iterate scalability tasks along with user stories, but in actuality, complex enterprise web applications are usually architected with the best of intentions with intelligent designs, but reach a breaking point at some level of load on some hardware platform that cause a stop-drop-and-roll effort to improve the scaling up and out of an application.  Companies with deadlines and tight deliverable schedules don’t consistently evaluate and factor the required work to make and keep an application scalable over time into iterations.  If someone tells you differently, they’re probably in sales and they’re definitely lying.

That being said, software developers, do not succumb to the pressure to deliver scalability improvements by reporting true but irrelevant statistics to management.

  • “I sped up database calls for GetUser() by 300%!” suggests anything that gets a user should see a three-fold improvement in speed.  If that database call is 1% of the login process time, then it will have no material impact.
  • “I reduced the size of page requests from 500K to 250K!” means “I doubled the performance or scalability of the application” to management, but in reality, it means neither.
  • “I found a problem between ServiceA and ServiceB and cut out three extraneous calls between them!” means nothing to anyone.  Did you remove three calls that are made once an hour by a batch process, or three calls made for every user login?  What was the impact of those calls on performance and scalability before and after the optimization?
  • “ServiceA is a big problem and has a lot of errors.  I removed a lot of exceptions on ServiceA.  Exceptions cause performance problems.” is problematic on several levels.  Why were the exceptions being thrown?  Did removing them fix or just sweep a real problem under the carpet?  If it was justified, what improvement did it have on the overall system?

When software developers communicate their changes, it implies they have meaningful impact. However, many software developers fail to measure the before and after impact of their changes on the whole system, but typically only evaluate them in the microcosm of the area they changed.  This is about as useful as management suggesting areas they should fix based on intuition or high-level reporting tools.

While most devs don’t do scientific computing, scaling applications is an empirical task that demands meaningful measurement in a realistic testing context.  There is spec document or product owner guidance on improving scalability: you must treat it as a scientific experiment.  Observe, hypothesize, have a control (the pre-change measurement), experiment, report data.  If you fail to discretely value each change with before and after metrics, you’re just shooting in the dark.  Cowboy coding gets teams into scalability messes, not out of them.

Especially, though, don’t give updates on enhancements that you cannot verify improve scalability with before and after numbers.  If you fix a problem that doesn’t improve the overall system scalability, which happens often in scalability improvement iterations, highlighting your accomplishments when there is no observable improvement suggest you are either ineffective or not working on the right items.  Worse, in crunch times, providing such updates gives a false sense of accomplishment to management.  Improving scalability, or performance for that matter, has no done-state.  But providing meaningless accomplishment notes to management will accelerate the sense of “we’re done enough”, when in fact, you may not have even identified the most significant issue to your scalability for your particular scenario.

And if you haven’t, let me do it for you:  You’re repeating your requests.  Trust me on that one. :-)

 
Leave a comment

Posted by on December 4, 2013 in Programming

 

A Brief Introduction to Part-of-Speech Tagging

A field of computer science that has captured my attention lately is computational linguistics — the inexact science of how to get a computer to understand what you mean.  This could be something as futuristic as Matthew Broderick’s battle with the WOPR, or with something more practical, like Siri.  Whether it be text entered by a human into a keyboard or something more akin to understanding the very unstructured format of human speech, understanding the meaning behind parsed words is incredibly complex — and to someone like me — fascinating!

My particular interest as of late is parsing — which from a linguistic perspective, means the breaking down of a string of characters into words, their meanings, and stringing them together in a parse tree, where the meanings of individual words as well as the relationships between words is composed into a logical construct that allows higher order functions, such as a personal assistant.  Having taken several foreign language classes before, then sitting on the other side of the table as an ESL teacher, I can appreciate the enormous ambiguity and complexity of any language, and much more so English among Germanic languages, as to creating an automated process to parse input into meaningful logical representations.  Just being able to discern the meaning of individual words given the multitude of meanings that can be ascribed to any one sequence of characters is quite a challenge.

Parsing Models

Consider this:  My security beat wore me out tonight.

In this sentence, what is the function of the word beat?  Beat functions as either a noun or a verb, but in this context, it is a noun.  There are two general schools of thought around assigning a tag as to what part of speech (POS) each word in a sentence functions as — iterative rules-based methods and stochastic methods.  In rules-based methods, like Eric Brill’s POS tagger, a priority-based set of rules that set forth language-specific axioms, such as “when a word appears to be a preposition, it is actually a noun if the preceding word is while”.  A complex set of these meticulously constructed conditions is used to refine a more course dictionary-style assignment of POS tags.

Stochastic methods, however, are more “fuzzy” methods of building advanced statistical models of how words should be tagged not based on a procedural and manual analysis of edge cases and their mitigations, but using training models over pre-tagged corpra, in a manner hearkening to the training sets applied to neural networks.  These trained models are then used as a baseline for assigning tags to incoming text, but no notable option for correction of any specific error or edge case other than retraining the entire model is available for refinement.  One such very interesting concept is treating the tagging of parts of speech as Hidden Markov Models, which is a probabilistic model that strives to explain how a process with a defined pattern that is not known other than sparse characteristics of the model and the inputs and the outputs through the process.

This continues to be a good candidate for doctorial theses in computer science disciplines.. papers that have caused me to lose too much sleep as of late.

Parsing Syntax

Even describing parts of speech can be as mundane as your elementary school grammar book, or as rich as the C7 tagset, which provides 146 unique ways to describe a word’s potential function.  While exceptionally expressive and specific, I have become rather fond of the Penn Treebank II tagset, which defines 45 tags that seem to provide enough semantic context for the key elements of local pronoun resolution and larger-scale object-entity context mapping.  Finding an extensively tagged Penn Treebank corpus proves difficult, however, as it is copyright by the University of Pennsylvania, distributed through a public-private partnership for several thousand dollars, and the tagged corpus is almost exclusively a narrow variety of topics and sentence structures — Wall Street Journal articles.  Obtaining this is critical to use as a reference check for writing a new Penn Treebank II part-of-speech tagger, and it prevents the construction of a more comprehensive Penn-tagged wordlist, which would be a boon for any tagger implementation.  However, the folks at the NLTK has provided a 10% free sample under Fair Use that has provided somewhat useful for both checking outputs in a limited fashion, but also for generating some more useful relative statistics about relationships between parts of speech within a sentence.

To produce some rudimentary probabilistic models to guide ambiguous POS-mappings for individual words, I wrote a five-minute proof of concept that scanned the NLTK-provided excerpt of the WSJ Penn Treebranch corpus to produce probabilities of what the next word’s part of speech would be given the previous word’s tag. The full results are available in this gist.

Future Musings

My immediate interest, whenever I get some free time on a weekend (which is pretty rare these days due to the exceptional pace of progress at our start-up), is pronoun resolution, which is the object of this generation’s Turing Test — the Winograd Schemas.  An example of such a challenge is to get a machine to answer this kind of question — Joe’s uncle can still beat him at tennis, even though he is 30 years older. Who is older? This kind of question is easy for a human to answer, but very, very hard for a machine to infer because (a) it can’t cheat to Google a suitable answer, which some of the less impressive Turing Test contestant programs now do, and (b) it requires not only the ability to successfully parse a sentence into its respective parts of speech, phrases, and clauses, but it requires the ability for a computer to resolve the meaning of a pronoun.  That’s an insanely tough feat!  Imagine this:

“Annabelle is a mean-spirited person.  She shot my dog out of spite.”

A program could infer “my dog” is a dog belonging to the person providing the text.  This has obvious applications in the real world if you can do this, and it has been done before.  But, imagine the leap in context that is exponentially harder to overcome when resolving “She”.  This requires not only an intra-sentence relationship of noun phrases, possessive pronouns, direct objects, and adverbial clauses, but it also requires the ability to carry context forward from one sentence to the next, building a going “mental map” of people, places, things — and building a profile of them as more information or context is provided.  And, if you think that’s not hard enough to define .. imagine the two additional words appended on to this sentence:

, she said.

That would to a human indicate dialog, which requires a wholly separate frame of Inception-style reference between contextual frames.  The parser is reading text about things which is actually being conveyed by other things — both sets of frames have their own unique, but not necessarily separate, domains and attributes.  I’m a very long-way off from ever getting this diversion in my “free time” anywhere close to functioning as advertised… but, then again, that’s what exercises on a weekend are for — not doing, but learning. :)

 
2 Comments

Posted by on August 22, 2013 in Programming

 
 
Follow

Get every new post delivered to your Inbox.

Join 42 other followers