iDatamining.org

I am looking for projects to work on
Please contact with me at yiyu.jia@iDataMining.org!

Saturday, March 30, 2013

OAuth 1.0 and OAuth 2.0, three-legged authenticaton and two-legged authentication

It is easy for a Web developer to understand why we need OAuth1.0. It is a three legs scenario: 1) Web browser user, 2) third party service provider that want to access user's info from 3) OAuth provider that hosts user's valuable info.

So, for the reason of security, web browser user should not just gives third party web site his/her user name and password to get info on behalf of himself/herself. You know, not every web site could be trusted. It is not only because of immoral web sites may use your personal info in wrong way but also because of their capability of securing your privacy. Therefore, URL redirecting comes into OAuth framework. Also, web browser use may not want to allow the web site to access all info hosted in OAuth service provider's site. Therefore, the OAuther plays as authorization framework too. Furthermore, for the reason of light encryption and signature, HMAC was employed. But, OAuth2.0 drop the signatures and asks the authorrization request to be send over SSL/STL, in which securet key is delivered and directly validated. So, in OAuth2.0, there is no requirements for the order of parameters. However, the benefit is not cost free as the server and client have to spend more resource to handle SSL communication.

To be used in native app, like movbile app, OAuth2.0 standardize the extension about two legs authorization, which I think there is security problem. You may think a server or user's own server could be trusted to have user's cridential info. But, it will be highly risk to implement two leg flow in a mobile app, which may be used on large amount device and different people. Here is an article about OAuth2.0 and the road to hell. But, in case of users want to make their own application to get info through OAuth service provider. It doesnot matter to use password flow. And put this flow under the name of OAuth (OAuth 2.0 actually).


Below is three different flows from Salesforce as examples. It is important for developer to understand which flow they need to start with according to their use case. It is interesting to see they declare that their flow is for authenticating purpose. What is the difference between authenticationa nd authorization if all resources are expose once authentication is past.

OAuth 2.0 User-Agent Flow


OAuth 2.0 Web server


OAuth 2.0 Username-Password Flow
http://stackoverflow.com/questions/7561631/oauth-2-0-benefits-and-use-cases-why http://hueniverse.com/2012/07/oauth-2-0-and-the-road-to-hell/ http://www.wolfe.id.au/2012/10/20/what-is-hmac-and-why-is-it-useful/

Sunday, March 17, 2013

anatomy of hadoop Map/Reduce process


In Hadoop MapReduce framework, besides map and reduce phases, the shuffling phases is actually more important as lots of optimizations jobs are done in shuffling phase. Below diagram shows what may happens in the Shuffling step. Each steps there are chance for speeding up processing or slowing down processing.


Above sub steps happens in spill thread. Spill thread read Map output from buffer (100MB by default) and do these sub steps based on if those respondent functions are defined. Eventually, all small split files will be merged into a big one as Map output. 


http://blog.sina.com.cn/s/blog_62186b4601012fty.html Shuffle是指从Map产生输出开始,包括系统执行排序以及传送Map输出到Reducer作为输入的过程。 首先从Map 端开始分析。当Map开始产生输出时,它并不是简单的把数据写到磁盘,因为频繁的磁盘操作会导致性能严重下降。它将数据首先写到内存中的一个缓冲区,并做了一些预排序,以提升效率。这个缓冲区默认大小是100MB。当缓冲区中的数据量达到一个特定阀值(默认是0.80)时,系统将会把缓冲区中的内容spill到磁盘。在spill过程中,Map 的输出将会继续写入到缓冲区,但如果缓冲区已满,Map 就会被阻塞直到spill完成。spill 线程在把缓冲区的数据写到磁盘前,会首先根据数据所属的partition 排序,(partition用于确定每部分数据之后交由哪个reduce继续处理),然后每个partition中的数据再按Key 排序。如果设定了Combiner,将在排序输出的基础上运行Combiner。Combiner 就是一个本地Reducer,使得Map的输出更紧凑,更少的数据会被写入磁盘和传送到Reducer。 每当内存中的数据达到spill 阀值的时候,都会产生一个新的spill 文件,所以在Map任务写完它的最后一个输出记录时,可能会有多个spill文件。在Map任务完成前,所有的spill文件将会被归并为一个已分区和已排序的输出文件。 当spill 文件归并完毕后,Map将删除所有的临时spill 文件,并告知TaskTracker 任务已完成。Map 的输出文件放置在运行Map 任务的TaskTracker 的本地磁盘上,它是运行Reduce 任务的TaskTracker 所需要的输入数据。Reducers 通过HTTP 来拷贝获取对应的输入数据。当所有的Map 输出都被拷贝后,Reduce任务进入排序阶段(更恰当的说应该是归并阶段,因为排序在Map 端就已经完成),这个阶段会对所有的Map 输出进行归并排序。 之后在Reduce 阶段,Reduce 函数会作用在排序输出的每一个key/value对上。这个阶段的输出被直接写到输出文件系统,一般是HDFS。到此,MapReduce 的Shuffle 和Sort 分析完毕。

Wednesday, March 13, 2013

Blocking Third-Party Cookies may not be such a big deal for online adv industry

It is announced that Firefox will block third-party cookie by default as what Safari have been done so for years. (See blog posts Firefox getting smarter about third-party cookies and the New Firefox Cookie Policy).

This is big news in the online advertisement industry. There are many reports on the internet. As post "Imagining a World Without Cookies" points out, there will be winners and losers if there is no third-party cookies. I pretty much agree with their points. But, before publishers start cheering for this, they need to seriously think about what will be their solution to replace the third-party cookies from giants in the industry. Obviously, publisher should be able to produce solution to collect, store, analyze, secure, and share user behavior data.

But, if publishers are not able to implement such solution. What could happen? I think there are two ways to go at least. Publishers can either license solution from third-party software vendor or host javascript code distributed by traditional cookie collectors. I will call this javascript code as cookie proxy.


Figure 1. Javascript proxy convert third-party cookie to first-party cookie


As shown in Figure 1, current third-party cookie can be easily converted into first-party cookie. The difference/questions are,
  1.  Publishers need to host a third-party cookie Javascript proxy code on their domain. The proxy code expose API for traditional third party cookie generators to call for generate cookies.
  2. In other words, now, it is publisher's right to say if they allow third-party Javascript to be hosted on their domain. 
  3. One interesting questions is, to allow hosting third-party Javascript code, who will take the responsibility to promise the safety of third-party Javascript code. We can see normally, third-party cookie generators have deep dynamical Javascript loading, which make tracing code to be more difficult.
  4. The resource allocated by Web browser for certain domain is limited. So, how much do these resource cost for traditional third-party cookie generator? Will they pay for it? 
  5. Some experiments need to be done in the future to figure out if sub-domain can walk around the limitation of resource allocated for cookies. According to RFC, this may work. 
Anyway, from the technical side, this third-party cookie issue belongs to cross domain script (XSS). I listed several ways for cross domain script in post "seven ways for cross domain script programming". I dont know how Firefox will handle techniques listed there against blocking third-party cookie by default. yet. It will be interesting to explore it. But, one thing should be sure that Javascript loaded from third-party domain is allowed. Otherwise, WWW distributed computing model would be damaged and something wonderful is stopped, for example, CDN etc.

This event clearly determines that our Web is still evolving quickly. WWW obviously has more and more job to do after define HTML5. For example, one thing I observed and post as title "A war to occupy Internet user's Web browser " could be interesting stuff to be solved from the RFC level.