Very excited to announce my selection and participation in Black Hat USA 2015 being held in Las Vegas this year. My talk is titled ‘Securing Your Big Data Environment’. Come join me in the South Seas CDF room in Mandalay Bay between 16:20 – 17:10 hours.
Summary of the talk: Hadoop and big data are no longer buzz words in large enterprises. Whether for the correct reasons or not, enterprise data warehouses are moving to Hadoop and along with it come petabytes of data. How do you ensure big data in Hadoop does not become a big problem or a big target. Vendors pitch their technologies as the magical silver bullet. However, did you realize that some controls are dependent on how many maps are available in the production cluster. What about the structure of the data being loaded? How much overhead does decryption operation add? If tokenizing data, how do you distinguish between in and original production data? However, in certain ways, Hadoop and big data represent a greenfield opportunity for security practitioners. It provides a chance to get ahead of the curve, test and deploy your tools, processes, patterns, and techniques before big data becomes a big problem.
Come join this session, where we walk through control frameworks we built and what we discovered, reinvented, polished, and developed to support data security, compliance, cryptographic protection, and effective risk management for sensitive data.
Unlike loss of a physical device, if an attacker breaks into your corporate network, you still have your data after they steal it. It is more important that ever to detect if your company has been broken into by a hacker. This article identifies a number of indicators of compromise activity on a corporate network. It is not an exhaustive list and I will keep adding to this list along with any recommended security measures you can take to detect and prevent activity that could lead to a compromise of your network by attackers.
Logging: When you log, you can detect and identify any unusual activity on your network and on the end points.
Look for logfile line count and log file line length. Have an average baseline of our log file size at a minimum and then trigger alerts when the log size increases or even worse decrease of events that day.
Look for spikes in traffic types (e.g. SSH, FTP, DNS) and baseline the number of events including bandwidth usage
Look for country of origin of IP connection (or by protocol)
Scan for the software/tools listed in “List of Publicly Available Tools used for Attacks” below. These include scanning for non-malicious network utilities like SysInternals and PsTools that are not rated as malicious by AV and others, but good tools for use by an attacker.
Scan for RDP Sessions in HKCV\Software\Microsoft\Windows\Shell\BagMRU and related keys
Scan for remote access services – VNC, RDP
Scan for remote access ports (TCP 3389, RDP or VNC)
Scan for batch files and scripts
Scan for multiple archive files – ZIPs and RARs including encrypted compressed files
Scan for rar/zip file compression in page files and unallocated spaces
Scan for programs run in the AppCompatCache
Scan for sysadmin tools executed such as tlist.exe, local.exe, kill.exe
Scan for files in the root of C:\RECYCLER
Scan for anomalies like abnormal source location or logon time (for example after say 7pm EST) and other time-of-use rules and baselines
Continue reading… “Indicators of Compromise List and Recommended Security Measures”
Why use a CDN?
Tracking Performance with HTTP Archive
In order to track the speed of the web over time, Google built and developed the HTTP Archive as an open source service. They transitioned the ownership and maintenance of it to the Internet Archive. It is a permanent repository of web performance information such as page size, requests made, and technologies utilized. Their list of URLs is based solely on the Alexa Top 1,000,000 sites. As of March 2012, there were a total of 77082 sites analyzed. This count is expected to ramp up to cover the top 1 Million websites on the Internet soon.
Google Libraries API CDN
Running analysis of the HTTP Archive for the usage of Google Libraries API, we got a count of 14345 sites of the 77082 sites tested. This is almost 18.61% which is quite impressive. However, this count does not account for the different libraries and their versions. This is important as sites must reference the exact same CDN URL from either Google or Microsoft or any other CDN to obtain the cross-site caching benefits.
In case you were curious, the Microsoft CDN is only used in 157 of the sites tested or only 0.2% of the most popular websites on the Internet leverage it compared to the 18.61% that leverage the Google Libraries API CDN.
Test 1: Validate the Google APIs reference
We want to verify the 18.61% of the pages. This count of 14345 sites comes from an HTTP request containing ‘googleapis.com’ anywhere in the URL. The query used to extract this particular statistic was
SELECT COUNT(DISTINCT pageid)
WHERE url LIKE '%googleapis.com%';
Test 2: Determine the percentage of pages that ran a jQuery version
Next, we want to determine the percentage of pages that was using atleast some version of jQuery from the list of the sites in the HTTP Archive. Running the following SQL query gives us the result of 11145 or 14.45% of the total sites analyzed.
SELECT COUNT(DISTINCT pageid)
WHERE url LIKE 'http%://ajax.googleapis.com/ajax/libs/jquery/%';
Test 3: Determine jQuery versions
Next, we want to determine the breakdown for each distinct URL that represented jQuery. Again, this is the statistic that would determine the cross-site caching benefit. Running the following query gives us the breakdown below:
SELECT url, COUNT(DISTINCT pageid) AS count
WHERE url LIKE 'https://ajax.googleapis.com/ajax/libs/jquery/%'
GROUP BY url
ORDER BY count DESC;
This proves that the fragmentation issues are very real. This reveals that the most popular URL for loading jQuery was jQuery 1.4.2 via http. This constitutes only 2.19% (1695 out of 77082) of all the websites tested. The next most popular is jQuery 1.3.2 with 1.48% (1142 out of 77082) and so on. It is not just the fragmentation of the different libraries used, but also the different protocols (e.g. http vs. https) since they are cached separately.
Another frustration is the “latest version” reference results. This allows you to request either version 1 or 1.x and automatically receive 1.x.y version.You can see the full output of the above query posted on GitHub if you are interested.
Conclusion & Recommendations
Make sure to use the Expires header to make HTTP requests cacheable since for your repeat visitors, it doesn’t matter where the file was served from
Browser users leveraging privacy options that clear the cache (e.g. privacy.clearOnShutdown.cache option) between browser sessions cannot leverage the assumed benefits of a CDN distribution
Another point of note coming from Steve Souders is that the amount of disk space for caching has not caught up with people’s usage of the web. For example, IE has a default cache size of 8 – 50 MB, Firefox has 50MB, Opera has 20MB and Chrome has ~80MB. Mobile phones have an even smaller size limit. This cache will eventually fill up and when it happens, the FIFO (first-in-first-out) rule applies where cached resources need to make way for new ones.
If websites are using the https reference to a JS library, the browsers usually default to not caching those files to disk when they are retrieved using SSL.
We live in a global village of interconnected systems that share data and other services. Such an environment calls for heightened awareness around application security. Enterprises should establish a strong application security program and integrate security into the entire software development lifecycle including the design, development, verification, and maintenance processes.
The following in an excellent Infographic from Veracode that talks about Application Security and where the vulnerabilities lie.